There are several open-source data quality frameworks that can help with profiling, validation, and monitoring of data. Here are some of the top options:
1. Great Expectations
• GitHub: https://github.com/great-expectations/great_expectations
• Features: Data validation, profiling, and documentation with expectations and tests.
• Best For: Teams needing data quality pipelines in Python.
• Integration: Works with Pandas, Spark, SQL databases, and cloud storage.
2. Deequ (by AWS)
• GitHub: https://github.com/awslabs/deequ
• Features: Data profiling, constraints validation, anomaly detection.
• Best For: Large-scale data validation using Apache Spark.
• Integration: Works well in AWS environments and big data platforms.
3. Monte Carlo (Open Source Part: Anomalo)
• GitHub: No open-source repo, but alternatives exist like Soda SQL.
• Features: Automated anomaly detection and observability.
• Best For: Data engineering teams looking for anomaly detection in data pipelines.
4. Soda SQL
• GitHub: https://github.com/sodadata/soda-sql
• Features: SQL-based data monitoring, anomaly detection, and validation.
• Best For: Teams using SQL-based data warehouses like Snowflake, BigQuery, and Redshift.
5. CloudDQ (by Google)
• GitHub: https://github.com/GoogleCloudPlatform/cloud-data-quality
• Features: Data quality rules for Google BigQuery.
• Best For: Google Cloud users needing automated data quality validation.
6. OpenMetadata
• GitHub: https://github.com/open-metadata/OpenMetadata
• Features: Data discovery, metadata management, lineage tracking, and quality checks.
• Best For: Enterprises managing metadata and governance.
Would you like recommendations based on a specific tech stack or business use case?