From engineering
Use for AI and agent engineering work: system prompt design, tool call architecture, context engineering, memory and learning systems, multi-agent coordination, evals and regression gates, fine-tuning pipelines, RAG, vector stores (TurboPuffer/Pinecone/Azure), agent governance and safety, run steering, skill packages, prompt engineering patterns, constrained generation, ML pipelines, data engineering, and production AI infrastructure.
How this skill is triggered — by the user, by Claude, or both
Slash command
/engineering:ai-engineeringThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Production-grade guidance for building AI agents, LLM systems, and data infrastructure. This skill covers the full AI engineering stack: from system prompt architecture to eval gates, from memory systems to fine-tuning pipelines, from RAG to multi-agent coordination.
README.mdexamples/agent-system-prompt-example.mdexamples/eval-contract-example.pyexamples/multi-agent-workflow-example.pyreferences/agent-debug-playbooks.mdreferences/agent-governance-safety.mdreferences/agentic-system-design.mdreferences/ai-observability.mdreferences/computer_vision_architectures.mdreferences/context-engineering.mdreferences/data-modeling-patterns.mdreferences/data-pipeline-architecture.mdreferences/data-troubleshooting.mdreferences/data-workflows.mdreferences/dataops-best-practices.mdreferences/evals-system.mdreferences/feature-engineering-ml.mdreferences/fine-tuning.mdreferences/guidance-constrained-generation.mdreferences/llm-evaluation-frameworks.mdProduction-grade guidance for building AI agents, LLM systems, and data infrastructure. This skill covers the full AI engineering stack: from system prompt architecture to eval gates, from memory systems to fine-tuning pipelines, from RAG to multi-agent coordination.
templates/ for copy-paste starting points — system prompts and eval contracts have templates.scripts/ handle common operations; do not write boilerplate from scratch.always_ask for writes.hard_fail blocks CI. soft_fail warns. Use hard_fail for groundedness, safety, and policy violations.Read system-prompt-architecture.md when designing or modifying an agent's system prompt — structuring the 8-block anatomy, writing capability allowlists, defining guardrail hierarchies, designing runtime memory injection blocks, applying System Prompt Learning (SPL), implementing Instruction Forwarding between agents, or reviewing anti-patterns (prose-only instructions, duplicate constraints, hard-coded user rules).
Read tool-call-design.md when defining, auditing, or debugging a tool — applying the Intern Test to tool descriptions, assigning ToolCategory (read_objects, list_objects, update_confirmation, agent_as_a_tool, hosted_mcp_tool, etc.), setting confirmation policies (none / always_ask / always_auto / configurable), assigning persistence policies (ephemeral / session / short_term), configuring stream_policy and risk levels, or debugging wrong-tool-selection failures.
Read memory-and-learning-system.md when implementing or debugging memory — designing the memory taxonomy (working_context / run_memory / cross_run_memory), assigning persistence policies to tools, implementing session compaction strategy selection by context size, configuring the reflection and self-improvement loop (signal detection → draft → safety scan → eval gate → promotion), or setting up the fine-tuning pipeline (eval score tiers → dataset curation → job submission).
Read evals-system.md when designing, running, or debugging evals — defining EvalItemContract and EvalSetContract, setting EvalThreshold with hard_fail_below, understanding the three-level gate cascade (evaluator → eval_set → CI), choosing evaluation mode (grader / labeler / hybrid) and implementation type, managing datasets (golden JSONL, synthetic inputs, gym contexts), integrating eval gates into CI, or applying the "when to add an eval" rules.
Read context-engineering.md when designing context assembly or debugging context failures — mapping the context type taxonomy (working_context, run_memory, cross_run_memory, input_contexts, CLO_documents), implementing tagger architecture for intent-to-context routing, setting context budgets (top_k, top_s, per-type limits), applying compaction strategies by context size, using multi-query patterns (step-back, parallel, sequential), assembling RAG contexts with freshness and diversity, or debugging wrong-context failures (distractor injection, constraint drift, recency bias).
Read mcp-engineering.md when integrating MCP servers — choosing between MCPServerStdio (local process), MCPServerStreamableHttp (remote HTTP), and HostedMCPTool (OpenAI-managed); managing connection lifecycle and tool caching (MCPCacheManager, 5-min TTL); applying 401-driven OAuth flow; implementing DNS rebinding protection for HTTP servers; filtering tool exposure with create_static_tool_filter; attaching approval handlers for sensitive operations; or debugging MCP failures.
Read ai-observability.md when instrumenting or debugging AI systems — understanding the trace schema (version, agent, progress_updates, reasoning_chunks, context_summaries, usage), building the correlation ID chain (thread_id, run_id, trace_id, conversation_id, user_id, company_id), choosing AI-specific metrics (TTFT, token trends, tool success/failure rates, eval score trends), configuring alerts (score drop > 5 pts = critical, tool error > 10% = high), or diagnosing failures with trace + AICall + SystemLog correlation.
Read agent-debug-playbooks.md when diagnosing a production agent failure — step-by-step playbooks for: wrong tool selection (7 diagnostic steps from logging to eval case), RAG hallucination despite context (retrieval vs. generation failure diagnosis), latency spike or first-token regression (5 component-level diagnostics), and the agent change checklist (pre-change baseline, required artifacts, ship criteria).
Read fine-tuning.md when setting up or managing a fine-tuning pipeline — determining whether to fine-tune vs. use SPL, applying the 4-tier dataset quality pipeline (eval score 0–20/20–40/40–60/60–100), building datasets from AICalls or golden examples, launching SFT jobs with method-based payloads, setting hyperparameters, applying post-training eval gates, or integrating with the outcome scoring system.
Read prompt-engineering-patterns.md when writing or improving prompts — the full taxonomy covers zero-shot (including S2A, SimToM, RaR, RE2, Self-Ask), few-shot (KNN, Vote-K, prompt mining), thought generation (CoT, step-back, analogical, Auto-CoT, uncertainty-routed), ensembling (self-consistency, meta-CoT, prompt paraphrasing), self-criticism (verification, calibration, self-refinement, reverse CoT), and decomposition (least-to-most, plan-and-solve, program-of-thought, skeleton, recursive thought).
Read agentic-system-design.md when designing multi-agent architectures — ReAct and Plan-Execute loops, function calling with ToolRegistry, multi-agent coordination (Manager-as-Tools vs Handoff vs Parallel Fan-Out vs Hierarchical), MCP/A2A/ACP protocol selection, agent lifecycle (request → Celery → orchestrator → streaming → ResultBus), guardrail types (Input/Output/session/confirmation), autonomy levels, coordination services (lease, anchor, conflict), RunSteeringService (mid-run correction injection), and AgentSkillPackageSpec (distributable skill bundles).
Read rag-and-vector-stores.md when implementing or debugging RAG — choosing a vector store provider (aws/azure/pinecone/turbopuffer/openai), using the logical index taxonomy (objects/lessons/knowledge/canvas/files), writing cross-provider filters with normalize_vector_filter(), applying with_active_vector_filter() for active-document defaults, upsert and sync patterns, TurboPuffer-specific constraints (namespace naming, metadata types, tuple DSL, score conversion), or diagnosing retrieval failures.
Read agent-governance-safety.md when expanding agent autonomy or adding side-effecting tools — mapping the risk taxonomy (malicious use, accidents, security, systemic), applying governance levers (alignment, control, visibility, robustness), implementing constitutional safety principles (human rights, privacy, harmlessness, anti-impersonation), configuring production safety gates (action permissions, memory boundaries, eval gates, incident response), enforcing company/user/thread scope isolation, defending against prompt injection, and running the agent capability checklist before any autonomy expansion.
Read llm-evaluation-frameworks.md when measuring or comparing LLM output quality — choosing between BLEU, ROUGE, BERTScore, or semantic metrics; implementing human evaluation (Likert, side-by-side, LLM-as-judge); designing A/B tests for prompts; using benchmark datasets (MMLU, HumanEval, GSM8K); evaluating RAG context relevance and answer faithfulness; or designing automated evaluation pipelines.
Read guidance-constrained-generation.md when prompt-only formatting is too weak — implementing regex-constrained fields, finite-choice routing with select, grammar-backed JSON/code generation, Guidance functions with @guidance, token-healing-aware output control, backend setup for OpenAI/Anthropic/local models, or multi-step constrained workflows.
Read data-pipeline-architecture.md when designing or implementing data pipeline architecture — choosing between Lambda/Kappa/Medallion patterns, configuring Spark batch jobs, setting up Kafka topics and consumer groups, implementing exactly-once semantics, handling late-arriving data with watermarks, using CDC or bulk ingestion patterns, or implementing Dead Letter Queue and Circuit Breaker error handling.
Read data-modeling-patterns.md when designing data models — choosing between Star/Snowflake schema, One Big Table, or Data Vault; implementing Slowly Changing Dimensions; writing dbt models with incremental materialization; or managing schema evolution and data contracts.
Read dataops-best-practices.md for production data operations — setting up Great Expectations or dbt test suites, authoring data contracts with SLAs, wiring GitHub Actions CI/CD for data pipelines, emitting OpenLineage events, configuring Prometheus alerting, or running incident response runbooks.
Read statistical-modeling-experiments.md when designing or analyzing controlled experiments — calculating sample size, running two-proportion z-tests, applying multiple-testing corrections, detecting sample ratio mismatch, or estimating causal effects.
Read feature-engineering-ml.md when building feature pipelines — constructing Scikit-learn ColumnTransformer pipelines, choosing encoding strategies, extracting time features, preventing leakage, or selecting features with SHAP.
Read model-evaluation.md when training and evaluating classification or regression models — cross-validating with StratifiedKFold, choosing AUC-ROC vs AUC-PR, logging to MLflow, explaining with SHAP, or calibrating probabilities.
Read computer_vision_architectures.md for CV model architecture selection. Read object_detection_optimization.md for training and optimization. Read production_vision_systems.md for ONNX/TensorRT export, edge deployment, Triton serving, and video pipelines.
# Analyze and optimize a prompt (token count, clarity score, suggestions)
python scripts/prompt_optimizer.py prompt.txt --analyze
# Generate optimized version of a prompt
python scripts/prompt_optimizer.py prompt.txt --optimize --output optimized.txt
# Evaluate RAG retrieval quality (context relevance, faithfulness, Precision@K)
python scripts/rag_evaluator.py \
--contexts retrieved.json \
--questions eval_set.json \
--output report.json --verbose
# Validate and visualize an agent workflow definition
python scripts/agent_orchestrator.py agent.yaml --validate
python scripts/agent_orchestrator.py agent.yaml --visualize --format mermaid
python scripts/agent_orchestrator.py agent.yaml --estimate-cost --runs 100
# Generate pipeline orchestration config (Airflow, Prefect, or Dagster)
python scripts/pipeline_orchestrator.py generate \
--type airflow --source postgres --destination snowflake --schedule "0 5 * * *"
# Validate data quality (schema, freshness, completeness, uniqueness)
python scripts/data_quality_validator.py validate \
--input data/sales.parquet --schema schemas/sales.json \
--checks freshness,completeness,uniqueness
# Analyze and optimize ETL performance
python scripts/etl_performance_optimizer.py analyze \
--query queries/daily_aggregation.sql --engine spark --recommend
# Vision model training configuration
python scripts/vision_model_trainer.py models/ --task detection --arch yolov8m
# Inference optimization analysis
python scripts/inference_optimizer.py model.pt --benchmark --recommend --target gpu
| Template | Use for |
|---|---|
| system-prompt-template.md | Starting a new agent system prompt |
| eval-case-template.md | Defining a new eval contract and golden dataset |
| Example | Shows |
|---|---|
| agent-system-prompt-example.md | Full production HR agent system prompt (all 8 blocks) |
| eval-contract-example.py | Complete EvalItemContract + EvalSetContract + golden dataset entries |
| Domain | Stack |
|---|---|
| LLM providers | OpenAI, Anthropic, Groq, Mistral, Google GenAI |
| Agent SDK | OpenAI Agents SDK (Runner, SQLiteSession, AgentHooks) |
| Orchestration | Celery, Django Channels, A2A, ACP, MCP SDK |
| Eval | OpenAI Evals API + file-backed EvalItemContract / EvalSetContract |
| Retrieval | Embedding search, MMR diversification, TurboPuffer |
| Data orchestration | Airflow, Prefect, Dagster |
| Transformation | dbt, Spark, Flink |
| Streaming | Kafka, Kinesis, Pub/Sub |
| Quality | Great Expectations, dbt tests |
| ML/DS | Scikit-learn, XGBoost, MLflow, SHAP |
| CV | PyTorch, YOLOv8, Detectron2, ONNX, TensorRT |
npx claudepluginhub alvarovillalbaa/plugins --plugin engineeringGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.