GenAI Knowledge Check: Master Summary

 

The Architecture (Questions 1, 2 & 9)

These questions focus on how a model is built and its physical limitations.

  • 1. Parameters:

    • Answer: Internal weights and settings that define the model's structure and intelligence.

    • Concept: Think of these as the "knobs" the model adjusts during training. More parameters often equal a more capable (but slower) model.

  • 2. Context Window Limit:

    • Answer: The model drops the earliest information to make room for new data, potentially leading to hallucinations.

    • Concept: Like short-term memory. Once it’s full, the "oldest" info is deleted so it can keep talking, which can cause it to lose track of original instructions.

  • 9. High-Volume/Low-Latency Tasks:

    • Answer: Small Language Models (SLMs).

    • Concept: If you need speed and repetition over deep reasoning, a smaller, lighter model is faster and cheaper than a massive "Frontier" model.


Enterprise Strategy (Questions 3, 4 & 8)

These focus on how businesses actually use AI to gain an advantage.

  • 3. The Competitive Moat:

    • Answer: Connecting GenAI to unique, proprietary data and domain expertise.

    • Concept: Everyone has the model; not everyone has your company's private data. That's the secret sauce.

  • 4. RAG (Retrieval-Augmented Generation):

    • Answer: It allows the model to look up real-time information from external trusted sources before generating an answer.

    • Concept: The "Open Book" method. It searches your files first, then answers based on what it found.

  • 8. Grounding:

    • Answer: It anchors the model's responses in specific, verified organizational data to reduce hallucinations.

    • Concept: Ensuring the AI "stays in its lane" by forcing it to use specific, verified facts rather than guessing.


Agents & Reasoning (Questions 5, 7 & 10)

These look at how AI moves from "chatting" to "doing."

  • 5. GenAI vs. AI Agents:

    • Answer: GenAI is for single-step generation, while agents use reasoning for multi-step, adaptive workflows.

    • Concept: GenAI is a calculator; an Agent is a mathematician who knows which buttons to press to solve a long word problem.

  • 7. The Intelligent Router:

    • Answer: Supervisor Agent Brick.

    • Concept: The "Manager." It listens to your request and decides which "specialist" (sub-agent) is the right one to fix it.

  • 10. The "Brilliant Intern" Analogy:

    • Answer: Highly knowledgeable but takes instructions extremely literally and lacks specific business context.

    • Concept: You have to be specific. It’s smart, but it doesn't know your company's "unspoken" rules yet.


Evaluation & Bias (Question 6)

How we measure if the AI is actually doing a good job.

  • 6. LLM-as-a-Judge (The "Con"):

    • Answer: It may exhibit "verbosity bias," favoring longer responses regardless of accuracy.

    • Concept: AI judges often fall for "fluff." They might give a higher grade to a long, poetic answer than a short, 100% correct one.


Quick Reference Comparison

FeatureStandard GenAIAI Agent
WorkflowSingle-turn (Input $\rightarrow$ Output)Multi-step (Plan $\rightarrow$ Tool $\rightarrow$ Result)
MemoryContext WindowContext + Long-term "Memory" storage
Data AccessTraining Data (Static)RAG / Grounding (Real-time)
LogicPattern RecognitionIterative Reasoning

After 23 years, I'll tell you what separates good pipelines from great ones.


It's never the tool.

It's always the discipline.


100 tips to write Clean Data Pipelines👇


1. Avoid NULLs in join keys

2. Sample data > mocks

3. Keep pipelines simple

4. Use DataGrip or dbt Cloud

5. Table names > inline comments

6. Split DAGs, don't monolith

7. Write fast data quality tests

8. Use strong column names

9. Schemas must fit contracts

10. Minimize SQL comments

11. Delete unused pipelines

12. Keep pipeline stages cohesive

13. Test data quality early & often

14. Master your query editor shortcuts

15. Set max SQL line width

16. Remove noise columns

17. Avoid hardcoded thresholds

18. Avoid hardcoded table names

19. Use SQL auto-formatters

20. Avoid monolithic DAGs

21. Commit pipeline changes early & often

22. Working pipeline ≠ clean pipeline

23. Comments explain business logic

24. Prefix boolean columns (is_, has_)

25. Use searchable column names

26. Don't duplicate transformation logic

27. Avoid bloated WHERE clauses

28. One transformation per CTE

29. Use consistent naming across the warehouse

30. No essays inside SQL comments

31. Link pipeline changes to tickets

32. Keep DAG inputs/outputs minimal

33. Avoid hardcoded global configs

34. Capture business logic in dbt models

35. Write repeatable data quality tests

36. Refactor pipelines early & often

37. Produce thorough data tests

38. Delete dead pipeline stages

39. Depend on table contracts, not raw sources

40. Use pronounceable table names

41. Keep proper SQL indentation

42. Write independent data quality checks

43. Don't abbreviate column names

44. Max 1 transformation per CTE

45. Use parameterized pipeline runs

46. Decouple ingestion from transformation

47. No horizontal SQL alignment

48. Use Arrange-Act-Assert in pipeline tests

49. Readable SQL > clever SQL

50. Limit pipeline task parameters

51. Use meaningful sample datasets

52. Readable pipeline > clever pipeline

53. Avoid boolean task flags

54. Hard-to-test pipeline = bad design

55. Use consistent SQL formatting standards

56. No transformation logic inside data tests

57. One responsibility per DAG

58. Write meaningful pipeline commit messages

59. Write deterministic data quality checks

60. Hide irrelevant columns in test fixtures

61. Use domain-based folder structure

62. Document your data contracts

63. Use nouns for table names

64. Review SQL before it hits production

65. Use consistent business terminology

66. Avoid storing everything as VARCHAR

67. Modular models > monolithic queries

68. Avoid pipelines with too many config params

69. Name tests: when_nulls_then_fail

70. One assertion per data test

71. Name tests after what they validate

72. Tests should fail loud, not silently

73. Never name columns is_not_deleted

74. is_active, has_churned, was_refunded

75. Assert row counts, not just "no error"

76. One output table per pipeline stage

77. Review data models with your team


Continued in comments 

From Blogger iPhone client