CDH migration Cloudera to AWS Databricks

Cloudera/Hortonworks to Azure Databricks for Pratt & Whitney, incorporating a bridging technique for the landing layer, AWS EventBridge for ingestion, and data quality/anomaly detection measures.




Pratt & Whitney – Data Lake Migration from Hortonworks (CDP) to Azure Databricks



Project Title:

Modernizing Pratt & Whitney’s Enterprise Data Lake – Hortonworks to Azure Databricks





Background:



As part of its digital transformation journey, Pratt & Whitney aimed to retire its legacy on-premise Hortonworks Data Platform (HDP) cluster, which was becoming increasingly costly to maintain and lacked elasticity for modern analytics needs. The strategic decision was made to migrate to Azure Databricks to leverage cloud scalability, Delta Lake reliability, and native ML/AI integration.





Key Objectives:




  • Migrate historical and active workloads from Hortonworks (CDP) to Azure Databricks.
  • Design a bridging strategy to synchronize landing zone data during the transition.
  • Implement an ingestion framework leveraging AWS EventBridge for real-time and batch ingestion.
  • Establish a Data Quality Framework including anomaly detection for data reliability and governance.






Migration Strategy Overview:




1. 

Data Lake Zones Re-Architecture:



Redesigned the data lake layers using the medallion architecture on Azure:



  • Landing (Raw): Replication of on-premise ingestions.
  • Bronze: Cleaned ingested data, including file validation and metadata enrichment.
  • Silver: Curated and joined data for analytical access.
  • Gold: Aggregated data for business-specific reporting and ML use cases.




2. 

Bridging Technique for Landing Layer:



To minimize disruption and ensure data continuity:



  • A bi-directional sync bridge was built between the legacy Hadoop landing directory (HDFS) and Azure Data Lake Storage Gen2 using DistCp and Azure Data Factory pipelines.
  • Kafka Connect was used as a buffer for real-time sync between HDFS and cloud landing zones to handle change data capture (CDC).
  • A cutover window was defined, during which both systems were synchronized and reconciled for data integrity using hash-based row-level comparisons.






Ingestion Framework Using AWS EventBridge:



Although the target platform is Azure-based, Pratt & Whitney utilized AWS EventBridge to trigger and orchestrate ingestion events from its distributed sensor and application systems (historically hosted on AWS):



  • Architecture Highlights:

  • AWS EventBridge captures data availability events from S3 and external sensors.
  • Events are passed to a Lambda-based bridge that places payloads into Azure Storage Queues or Event Grid.
  • Azure Functions consume these triggers to kick off Databricks Jobs or ADF Pipelines for downstream ingestion into the bronze layer.
  • This provided a hybrid event-driven ingestion mechanism, bridging cloud ecosystems during the transition period.






Data Quality & Anomaly Detection:



A modular data quality framework was implemented to monitor the integrity of data during and after ingestion.



Key Dimensions Measured:




  • Completeness: Row counts and expected null checks.
  • Accuracy: Rule-based validation using domain-specific thresholds.
  • Timeliness: SLAs on file delivery and latency between event capture and ingestion.
  • Uniqueness: Duplicate detection using hash-based keys.




Anomaly Detector:




  • Built on Azure Machine Learning and integrated with Databricks Delta tables.
  • Applied unsupervised models (Isolation Forest, Z-Score, Autoencoders) on time-series sensor and transactional datasets.
  • Detected:

  • Missing sequences in batch loads
  • Outliers in engine sensor readings
  • Unexpected volume surges in operational data

  • Alerting integrated with Azure Monitor and Teams/Slack channels for near real-time notification.






Outcomes:




  • Successfully migrated 120+ TB of data with less than 1-hour downtime.
  • Reduced data ingestion latency by 40% via event-driven architecture.
  • Enhanced data quality score from 72% to 95% (measured monthly).
  • Enabled new ML pipelines for predictive maintenance leveraging curated gold data.



From Blogger iPhone client