Aviation week data pipeline - Pratt and Whitney


Pratt & Whitney – Aviation Week Utilization Pipeline Integration



To enhance operational visibility and aircraft performance analytics, Pratt & Whitney developed an external data ingestion pipeline to integrate Aviation Week utilization data into its enterprise Azure-based data lake.




Pipeline Architecture Overview:




1. Data Extraction using Apache Spark:




  • Aviation Week provides usage statistics and asset performance data through RESTful APIs and flat file dumps.
  • A Spark-based extraction engine was developed to ingest large volumes of structured and semi-structured data (CSV, JSON) from the source system.
  • Spark jobs run in Databricks notebooks or on a standalone Spark cluster and normalize the incoming data to match Pratt & Whitney’s bronze layer schema.




2. Real-Time Stream Ingestion with Kafka:




  • Spark writes transformed data to Apache Kafka topics, enabling decoupling of downstream consumers and real-time processing.
  • A custom Kafka enrichment service computes:

  • “Aging values” — the delta between event timestamp and current date, used for trend analysis.
  • “Snapshot values” — capturing the state of aircraft utilization at predefined intervals (daily/hourly) to build a historical change log.




3. Workflow Orchestration with Apache Airflow:




  • All ingestion, transformation, and quality checks are orchestrated through Apache Airflow.
  • DAGs schedule Spark jobs, Kafka producers, and downstream processing at controlled intervals (e.g., hourly for streaming, daily for batch snapshots).
  • Airflow also triggers alerting mechanisms in case of ingestion failures or SLA breaches.




4. Data Quality Validation with Great Expectations:




  • Each batch of incoming data is passed through a Great Expectations checkpoint before landing in the bronze layer.
  • Validations include:

  • Schema conformity
  • Null checks on critical fields (tail number, flight hours)
  • Distribution checks on numeric fields (e.g., cycles per flight)

  • Validation results are stored and visualized, with failed records quarantined in a separate S3 path.






Outcome:




  • Enabled near real-time and historical analysis of third-party aircraft utilization data.
  • Improved fleet readiness analytics, maintenance forecasting, and integration with internal engine performance dashboards.
  • Established a reusable external ingestion framework for additional aerospace industry datasets.


From Blogger iPhone client