Core competencies in Data Engineering

Two Data Engineers interviewed at FAANG for a Data Engineering role.


One got rejected.

One got hired.

Same interviews.


Different grasp of fundamental data concepts.


Meta tests foundations because their stack is proprietary.


You can't learn their tools before joining. But you can master the core principles that translate to any data system.


Here's the fundamental Data Engineering stack you need to master, no matter the company you're aiming for.


Layer 1: Data Modeling & Schema Design


The foundation everything builds on.


- Normalization vs denormalization tradeoffs.

- Star and snowflake schemas.

- Slowly changing dimensions.

- Partitioning and bucketing strategies.


Poor modeling? Your queries will never scale.


Layer 2: SQL & Query Optimization


Your primary language for data.


- Complex joins and window functions.

- Query execution plans and indexes.

- Subquery vs CTE performance.

- Aggregation optimization techniques.


Can't write efficient SQL? You won't pass the technical.


Layer 3: Distributed Systems Fundamentals


How data systems actually work at scale.


- CAP theorem and consistency models.

- Partitioning and replication strategies.

- Distributed query processing.

- Fault tolerance and recovery.


Miss these concepts? You can't reason about production issues.


Layer 4: Data Pipeline Architecture


Moving data reliably at scale.


- Batch vs streaming tradeoffs.

- Idempotency and exactly-once processing.

- Backfill strategies and data quality.

- Orchestration and dependency management.


Bad pipelines? Data teams lose trust in your work.


Layer 5: Storage Systems & Formats


Where and how you store matters.


- Row vs columnar storage tradeoffs.

- Parquet, ORC, Avro characteristics.

- Data lake vs warehouse patterns.

- Compression and encoding strategies.


Wrong storage choices kill query performance.


Layer 6: Data Quality & Observability


Production data is messy.


- Schema validation and evolution.

- Data lineage and impact analysis.

- Monitoring pipeline health.

- SLA definition and alerting.


No observability? You're flying blind in production.


Layer 7: Performance & Scalability


The difference between junior and senior.


- Understanding data skew and hotspots.

- Memory vs disk tradeoffs.

- Caching strategies and materialization.

- Cost optimization techniques.


Can't optimize? Your pipelines won't survive scale

From Blogger iPhone client