Medallion layer management

Managing Medallion Architecture schemas (Bronze → Silver → Gold) dynamically using open-source tools is an efficient and scalable way to handle schema evolution, automation, and governance without relying on commercial tools.


The goal is to have a system that can read schema definitions dynamically (from YAML or JSON), automatically generate and manage schemas and views, support schema evolution when new columns are added or removed, and work consistently across all three layers — Bronze, Silver, and Gold — while remaining fully open-source and version-controlled.


Recommended open-source stack includes:



  • Schema storage: YAML or JSON with Git for versioning
  • Schema versioning: Liquibase or schemachange
  • Data modeling: dbt Core (open source)
  • Metadata and validation: Great Expectations or pydantic
  • Workflow orchestration: Apache Airflow or Dagster
  • Schema registry (optional): Apache Atlas or Amundsen for lineage tracking



For the Bronze layer, which handles raw ingestion, schemas can be inferred automatically from data sources such as Parquet, JSON, or CSV using Python libraries like pandas or pyarrow. The inferred schema is then stored as YAML for tracking and governance. Technologies such as Delta Lake or Apache Iceberg can be used to handle schema evolution and metadata automatically. The YAML file for Bronze would typically define the data source location, inference date, and the inferred schema fields.


In the Silver layer, which focuses on standardized and cleaned data, dynamic schema management comes into full use. YAML is used to define expected schemas, which can then be validated against the actual data before transformation. Great Expectations can enforce schema and data validation, while Jinja templating can be used to dynamically generate SQL that applies cleaning, casting, and normalization logic. For example, columns can be automatically cast to their expected types or cleaned using standard transformations defined in YAML. These transformations can then be executed using dbt Core or custom Python SQL runners.


The Gold layer is designed for aggregations and business-level transformations. Here, you can use YAML-driven metric definitions, specifying KPIs, calculations, and grouping logic. SQL views or materialized views are generated dynamically from these YAML definitions, providing a curated, business-ready data layer. These gold views can also be refreshed periodically using orchestration tools like Airflow.


Schema evolution is managed by automatically detecting changes, such as new columns in the Bronze layer, and propagating them downstream where appropriate. Every schema update is versioned in Git or logged in a changelog YAML with details like version number, added or dropped columns, and timestamps. Open-source tools like schemachange, Liquibase, or pyiceberg help track and apply these schema changes across environments.


Automation is typically handled through an orchestration pipeline such as Airflow or Dagster, which can run jobs in sequence: ingesting raw data and inferring schema for Bronze, storing or updating the schema YAML, validating and cleaning data for Silver, generating and updating Gold views, and finally notifying stakeholders of any schema changes.


A recommended folder structure for this approach includes three schema folders (bronze, silver, gold) each containing YAML files for datasets, a scripts folder with Python utilities for schema generation and validation, a dbt project folder for model definitions, and a folder for orchestration DAGs such as medallion_schema_dag.py.


Best practices summary:

For Bronze, schemas are inferred automatically using tools like pyarrow or delta-rs and stored as YAML for documentation. For Silver, schemas are defined explicitly in YAML and validated with Great Expectations, while dbt or templated SQL applies the cleaning and transformation. For Gold, schemas are YAML-defined and version-controlled, with SQL logic dynamically generated to build metrics or views for business consumption.


Combining open-source tools such as Delta Lake or Iceberg for data storage and evolution, dbt Core for transformations, Great Expectations for validation, Airflow or Dagster for orchestration, and YAML with Git for schema as code provides a fully open-source, dynamic, and maintainable framework for Medallion Architecture schema management.


If needed, this setup can be extended with a fully open-source schema governance framework by integrating Apache Atlas or Amundsen for lineage tracking and metadata discovery.


If you’d like, a complete open-source architecture diagram and folder template can be generated that illustrates how YAML, Python, dbt, and Great Expectations interact to implement this dynamic Medallion schema management approach.



If you want fully open-source dynamic schema governance, you can combine:


  • Delta Lake or Iceberg for storage and schema evolution
  • dbt Core for transformation
  • Great Expectations for schema validation
  • Airflow or Dagster for orchestration
  • YAML + Git for schema as code


Folder structure

/schemas

 /bronze

  customers.yaml

  orders.yaml

 /silver

  customers_clean.yaml

  orders_enriched.yaml

 /gold

  customer_metrics.yaml


/scripts

 schema_generator.py

 schema_validator.py


/dbt_project

 models/

  bronze/

  silver/

  gold/


/airflow_dags

 medallion_schema_dag.py


From Blogger iPhone client