After 23 years, I'll tell you what separates good pipelines from great ones.
It's never the tool.
It's always the discipline.
100 tips to write Clean Data Pipelines👇
1. Avoid NULLs in join keys
2. Sample data > mocks
3. Keep pipelines simple
4. Use DataGrip or dbt Cloud
5. Table names > inline comments
6. Split DAGs, don't monolith
7. Write fast data quality tests
8. Use strong column names
9. Schemas must fit contracts
10. Minimize SQL comments
11. Delete unused pipelines
12. Keep pipeline stages cohesive
13. Test data quality early & often
14. Master your query editor shortcuts
15. Set max SQL line width
16. Remove noise columns
17. Avoid hardcoded thresholds
18. Avoid hardcoded table names
19. Use SQL auto-formatters
20. Avoid monolithic DAGs
21. Commit pipeline changes early & often
22. Working pipeline ≠ clean pipeline
23. Comments explain business logic
24. Prefix boolean columns (is_, has_)
25. Use searchable column names
26. Don't duplicate transformation logic
27. Avoid bloated WHERE clauses
28. One transformation per CTE
29. Use consistent naming across the warehouse
30. No essays inside SQL comments
31. Link pipeline changes to tickets
32. Keep DAG inputs/outputs minimal
33. Avoid hardcoded global configs
34. Capture business logic in dbt models
35. Write repeatable data quality tests
36. Refactor pipelines early & often
37. Produce thorough data tests
38. Delete dead pipeline stages
39. Depend on table contracts, not raw sources
40. Use pronounceable table names
41. Keep proper SQL indentation
42. Write independent data quality checks
43. Don't abbreviate column names
44. Max 1 transformation per CTE
45. Use parameterized pipeline runs
46. Decouple ingestion from transformation
47. No horizontal SQL alignment
48. Use Arrange-Act-Assert in pipeline tests
49. Readable SQL > clever SQL
50. Limit pipeline task parameters
51. Use meaningful sample datasets
52. Readable pipeline > clever pipeline
53. Avoid boolean task flags
54. Hard-to-test pipeline = bad design
55. Use consistent SQL formatting standards
56. No transformation logic inside data tests
57. One responsibility per DAG
58. Write meaningful pipeline commit messages
59. Write deterministic data quality checks
60. Hide irrelevant columns in test fixtures
61. Use domain-based folder structure
62. Document your data contracts
63. Use nouns for table names
64. Review SQL before it hits production
65. Use consistent business terminology
66. Avoid storing everything as VARCHAR
67. Modular models > monolithic queries
68. Avoid pipelines with too many config params
69. Name tests: when_nulls_then_fail
70. One assertion per data test
71. Name tests after what they validate
72. Tests should fail loud, not silently
73. Never name columns is_not_deleted
74. is_active, has_churned, was_refunded
75. Assert row counts, not just "no error"
76. One output table per pipeline stage
77. Review data models with your team
Continued in comments
From Blogger iPhone client