Pratt & Whitney – Aviation Week Utilization Pipeline Integration
To enhance operational visibility and aircraft performance analytics, Pratt & Whitney developed an external data ingestion pipeline to integrate Aviation Week utilization data into its enterprise Azure-based data lake.
Pipeline Architecture Overview:
1. Data Extraction using Apache Spark:
- Aviation Week provides usage statistics and asset performance data through RESTful APIs and flat file dumps.
- A Spark-based extraction engine was developed to ingest large volumes of structured and semi-structured data (CSV, JSON) from the source system.
- Spark jobs run in Databricks notebooks or on a standalone Spark cluster and normalize the incoming data to match Pratt & Whitney’s bronze layer schema.
2. Real-Time Stream Ingestion with Kafka:
- Spark writes transformed data to Apache Kafka topics, enabling decoupling of downstream consumers and real-time processing.
- A custom Kafka enrichment service computes:
- “Aging values” — the delta between event timestamp and current date, used for trend analysis.
- “Snapshot values” — capturing the state of aircraft utilization at predefined intervals (daily/hourly) to build a historical change log.
3. Workflow Orchestration with Apache Airflow:
- All ingestion, transformation, and quality checks are orchestrated through Apache Airflow.
- DAGs schedule Spark jobs, Kafka producers, and downstream processing at controlled intervals (e.g., hourly for streaming, daily for batch snapshots).
- Airflow also triggers alerting mechanisms in case of ingestion failures or SLA breaches.
4. Data Quality Validation with Great Expectations:
- Each batch of incoming data is passed through a Great Expectations checkpoint before landing in the bronze layer.
- Validations include:
- Schema conformity
- Null checks on critical fields (tail number, flight hours)
- Distribution checks on numeric fields (e.g., cycles per flight)
- Validation results are stored and visualized, with failed records quarantined in a separate S3 path.
Outcome:
- Enabled near real-time and historical analysis of third-party aircraft utilization data.
- Improved fleet readiness analytics, maintenance forecasting, and integration with internal engine performance dashboards.
- Established a reusable external ingestion framework for additional aerospace industry datasets.