After 23 years, I'll tell you what separates good pipelines from great ones.


It's never the tool.

It's always the discipline.


100 tips to write Clean Data Pipelines👇


1. Avoid NULLs in join keys

2. Sample data > mocks

3. Keep pipelines simple

4. Use DataGrip or dbt Cloud

5. Table names > inline comments

6. Split DAGs, don't monolith

7. Write fast data quality tests

8. Use strong column names

9. Schemas must fit contracts

10. Minimize SQL comments

11. Delete unused pipelines

12. Keep pipeline stages cohesive

13. Test data quality early & often

14. Master your query editor shortcuts

15. Set max SQL line width

16. Remove noise columns

17. Avoid hardcoded thresholds

18. Avoid hardcoded table names

19. Use SQL auto-formatters

20. Avoid monolithic DAGs

21. Commit pipeline changes early & often

22. Working pipeline ≠ clean pipeline

23. Comments explain business logic

24. Prefix boolean columns (is_, has_)

25. Use searchable column names

26. Don't duplicate transformation logic

27. Avoid bloated WHERE clauses

28. One transformation per CTE

29. Use consistent naming across the warehouse

30. No essays inside SQL comments

31. Link pipeline changes to tickets

32. Keep DAG inputs/outputs minimal

33. Avoid hardcoded global configs

34. Capture business logic in dbt models

35. Write repeatable data quality tests

36. Refactor pipelines early & often

37. Produce thorough data tests

38. Delete dead pipeline stages

39. Depend on table contracts, not raw sources

40. Use pronounceable table names

41. Keep proper SQL indentation

42. Write independent data quality checks

43. Don't abbreviate column names

44. Max 1 transformation per CTE

45. Use parameterized pipeline runs

46. Decouple ingestion from transformation

47. No horizontal SQL alignment

48. Use Arrange-Act-Assert in pipeline tests

49. Readable SQL > clever SQL

50. Limit pipeline task parameters

51. Use meaningful sample datasets

52. Readable pipeline > clever pipeline

53. Avoid boolean task flags

54. Hard-to-test pipeline = bad design

55. Use consistent SQL formatting standards

56. No transformation logic inside data tests

57. One responsibility per DAG

58. Write meaningful pipeline commit messages

59. Write deterministic data quality checks

60. Hide irrelevant columns in test fixtures

61. Use domain-based folder structure

62. Document your data contracts

63. Use nouns for table names

64. Review SQL before it hits production

65. Use consistent business terminology

66. Avoid storing everything as VARCHAR

67. Modular models > monolithic queries

68. Avoid pipelines with too many config params

69. Name tests: when_nulls_then_fail

70. One assertion per data test

71. Name tests after what they validate

72. Tests should fail loud, not silently

73. Never name columns is_not_deleted

74. is_active, has_churned, was_refunded

75. Assert row counts, not just "no error"

76. One output table per pipeline stage

77. Review data models with your team


Continued in comments 

From Blogger iPhone client