From book-skills
Knowledge base from "AI Engineering: Building Applications with Foundation Models" by Chip Huyen. Use when applying Huyen's frameworks for foundation model adaptation, evaluation methodology, RAG architecture, prompt engineering, finetuning decisions, inference optimization, dataset engineering, and production AI system design.
How this skill is triggered — by the user, by Claude, or both
Slash command
/book-skills:huyen-ai-engineering [topic, framework name, or chapter number — e.g. 'RAG', 'LoRA', 'evaluation', 'ch06'][topic, framework name, or chapter number — e.g. 'RAG', 'LoRA', 'evaluation', 'ch06']This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Author**: Chip Huyen | **Pages**: ~450 | **Chapters**: 10 | **Generated**: 2026-06-02
chapters/ch01-intro-building-ai-apps.mdchapters/ch02-understanding-foundation-models.mdchapters/ch03-evaluation-methodology.mdchapters/ch04-evaluate-ai-systems.mdchapters/ch05-prompt-engineering.mdchapters/ch06-rag-and-agents.mdchapters/ch07-finetuning.mdchapters/ch08-dataset-engineering.mdchapters/ch09-inference-optimization.mdchapters/ch10-ai-engineering-architecture.mdcheatsheet.mdglossary.mdpatterns.mdAuthor: Chip Huyen | Pages: ~450 | Chapters: 10 | Generated: 2026-06-02
RAG, finetuning, sampling, evaluation, agents, etc.; I find the relevant chapterch06 or ch09; I load that chapter fileFor details beyond the core frameworks below, I will Read the relevant chapters/chNN-*.md, glossary.md, patterns.md, or cheatsheet.md files in this skill.
Try in this order; stop when performance is sufficient:
Core heuristic: "RAG is for facts, finetuning is for form." Finetuning cannot reliably add new knowledge; it changes how the model behaves. RAG adds knowledge; it does not change behavior.
Define evaluation criteria before building. Four criteria buckets:
Evaluation method picker:
Anti-pattern: Eyeballing results. A model that can't be measured can't be improved.
Sampling is the most underrated lever. Before changing models or finetuning, tune sampling parameters:
Temperature divides logits before softmax. T=0 means always pick the max logit (greedy).
RAG = feature engineering for foundation models. Key components:
Start simple:
Score(D) = Σ 1/(k + rank_i(D))Retrieval optimization order: chunking strategy → reranking → query rewriting → contextual retrieval
Long context vs. RAG: Long context doesn't eliminate RAG. RAG filters signal; raw context adds noise. Start RAG at ~200K tokens even for Claude models.
Inference memory = N × bytes/param × 1.2
Full finetuning memory ≈ 4× inference memory (gradients + Adam states)
LoRA: trainable params = 2 × n × r per layer (r = 16–64 typical)
FP16 (2B/param): 7B model = 14 GB weights; full finetuning = ~58 GB; LoRA = ~16 GB
LoRA formula: W' = W + (α/r) × A@B
Two bottlenecks:
Decode speed = main bottleneck. Quantization (fewer bytes/param) directly reduces bandwidth usage.
Key techniques:
MBU formula: (params × bytes × tokens/s) / theoretical_bandwidth — measures bandwidth efficiency
Start minimal; add layers as specific problems emerge:
User feedback is a competitive moat. Design explicit (thumbs, corrections) + implicit (session length, stop generation) feedback collection. Build the data flywheel: usage → feedback → annotation → improved model.
Pre-training (self-supervised, 98% of compute) → SFT (demonstration data, behavior alignment) → Preference finetuning (RLHF or DPO, human preference alignment)
DPO is simpler than RLHF (used by Llama 3); RLHF is more flexible (used by GPT-3.5, Llama 2).
| # | Title | Key Frameworks |
|---|---|---|
| ch01 | Introduction to Building AI Applications | 3-layer AI stack, adaptation categories, AI vs. ML engineering |
| ch02 | Understanding Foundation Models | Chinchilla scaling law, sampling strategies, post-training pipeline, memory math |
| ch03 | Evaluation Methodology | AI as a judge, functional correctness, perplexity/cross-entropy, embeddings |
| ch04 | Evaluate AI Systems | Evaluation-driven development, 4 criteria buckets, model selection, private benchmarks |
| ch05 | Prompt Engineering | CoT, in-context learning, prompt decomposition, defensive prompting, chat templates |
| ch06 | RAG and Agents | RAG architecture, hybrid search, RRF, chunking, agents, tool inventory, planning |
| ch07 | Finetuning | LoRA, PEFT, memory bottlenecks, RAG vs. finetuning decision, numerical formats |
| ch08 | Dataset Engineering | 3 golden goals, data quality 6 criteria, LIMA principle, data flywheel |
| ch09 | Inference Optimization | Prefill/decode split, KV cache, quantization, continuous batching, MFU/MBU |
| ch10 | AI Engineering Architecture & User Feedback | 5-step architecture, guardrails, model gateway, caching, observability, user feedback |
This skill covers the book content only (First Edition, December 2024). It does not include:
The author's core frameworks — adaptation hierarchy, evaluation-driven development, RAG vs. finetuning — are technology-agnostic and should remain relevant beyond specific tools. For hands-on implementation, combine this skill with current documentation and project-specific tools.
npx claudepluginhub andersonfpcorrea/andersonfpcorrea-skills --plugin book-skillsProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.