Deploys MLflow models, custom pyfunc, and GenAI agents to Databricks Model Serving endpoints. Queries endpoints, checks status, integrates UC Functions and Vector Search tools.
How this skill is triggered — by the user, by Claude, or both
Slash command
/databricks-ai-dev-kit:databricks-model-servingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Deploy MLflow models and AI agents to scalable REST API endpoints.
Deploy MLflow models and AI agents to scalable REST API endpoints.
| Model Type | Pattern | Reference |
|---|---|---|
| Traditional ML (sklearn, xgboost) | mlflow.sklearn.autolog() | 1-classical-ml.md |
| Custom Python model | mlflow.pyfunc.PythonModel | 2-custom-pyfunc.md |
| GenAI Agent (LangGraph, tool-calling) | ResponsesAgent | 3-genai-agents.md |
ALWAYS use exact endpoint names from this table. NEVER guess or abbreviate.
| Endpoint Name | Provider | Notes |
|---|---|---|
databricks-gpt-5-2 | OpenAI | Latest GPT, 400K context |
databricks-gpt-5-1 | OpenAI | Instant + Thinking modes |
databricks-gpt-5-1-codex-max | OpenAI | Code-specialized (high perf) |
databricks-gpt-5-1-codex-mini | OpenAI | Code-specialized (cost-opt) |
databricks-gpt-5 | OpenAI | 400K context, reasoning |
databricks-gpt-5-mini | OpenAI | Cost-optimized reasoning |
databricks-gpt-5-nano | OpenAI | High-throughput, lightweight |
databricks-gpt-oss-120b | OpenAI | Open-weight, 128K context |
databricks-gpt-oss-20b | OpenAI | Lightweight open-weight |
databricks-claude-opus-4-6 | Anthropic | Most capable, 1M context |
databricks-claude-sonnet-4-6 | Anthropic | Hybrid reasoning |
databricks-claude-sonnet-4-5 | Anthropic | Hybrid reasoning |
databricks-claude-opus-4-5 | Anthropic | Deep analysis, 200K context |
databricks-claude-sonnet-4 | Anthropic | Hybrid reasoning |
databricks-claude-opus-4-1 | Anthropic | 200K context, 32K output |
databricks-claude-haiku-4-5 | Anthropic | Fastest, cost-effective |
databricks-claude-3-7-sonnet | Anthropic | Retiring April 2026 |
databricks-meta-llama-3-3-70b-instruct | Meta | 128K context, multilingual |
databricks-meta-llama-3-1-405b-instruct | Meta | Retiring May 2026 (PT) |
databricks-meta-llama-3-1-8b-instruct | Meta | Lightweight, 128K context |
databricks-llama-4-maverick | Meta | MoE architecture |
databricks-gemini-3-1-pro | 1M context, hybrid reasoning | |
databricks-gemini-3-pro | 1M context, hybrid reasoning | |
databricks-gemini-3-flash | Fast, cost-efficient | |
databricks-gemini-2-5-pro | 1M context, Deep Think | |
databricks-gemini-2-5-flash | 1M context, hybrid reasoning | |
databricks-gemma-3-12b | 128K context, multilingual | |
databricks-qwen3-next-80b-a3b-instruct | Alibaba | Efficient MoE |
| Endpoint Name | Dimensions | Max Tokens | Notes |
|---|---|---|---|
databricks-gte-large-en | 1024 | 8192 | English, not normalized |
databricks-bge-large-en | 1024 | 512 | English, normalized |
databricks-qwen3-embedding-0-6b | up to 1024 | ~32K | 100+ languages, instruction-aware |
databricks-meta-llama-3-3-70b-instruct (good balance of quality/cost)databricks-gte-large-endatabricks-gpt-5-1-codex-mini or databricks-gpt-5-1-codex-maxThese are pay-per-token endpoints available in every workspace. For production, consider provisioned throughput mode. See supported models.
| Topic | File | When to Read |
|---|---|---|
| Classical ML | 1-classical-ml.md | sklearn, xgboost, autolog |
| Custom PyFunc | 2-custom-pyfunc.md | Custom preprocessing, signatures |
| GenAI Agents | 3-genai-agents.md | ResponsesAgent, LangGraph |
| Tools Integration | 4-tools-integration.md | UC Functions, Vector Search |
| Development & Testing | 5-development-testing.md | MCP workflow, iteration |
| Logging & Registration | 6-logging-registration.md | mlflow.pyfunc.log_model |
| Deployment | 7-deployment.md | Job-based async deployment |
| Querying Endpoints | 8-querying-endpoints.md | SDK, REST, MCP tools |
| Package Requirements | 9-package-requirements.md | DBR versions, pip |
%pip install -U mlflow==3.6.0 databricks-langchain langgraph==0.3.4 databricks-agents pydantic
dbutils.library.restartPython()
Or via MCP:
execute_code(code="%pip install -U mlflow==3.6.0 databricks-langchain langgraph==0.3.4 databricks-agents pydantic")
Create agent.py locally with ResponsesAgent pattern (see 3-genai-agents.md).
manage_workspace_files(
action="upload",
local_path="./my_agent",
workspace_path="/Workspace/Users/[email protected]/my_agent"
)
execute_code(
file_path="./my_agent/test_agent.py",
cluster_id="<cluster_id>"
)
execute_code(
file_path="./my_agent/log_model.py",
cluster_id="<cluster_id>"
)
See 7-deployment.md for job-based deployment that doesn't timeout.
manage_serving_endpoint(
action="query",
name="my-agent-endpoint",
messages=[{"role": "user", "content": "Hello!"}]
)
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
# Enable autolog with auto-registration
mlflow.sklearn.autolog(
log_input_examples=True,
registered_model_name="main.models.my_classifier"
)
# Train - model is logged and registered automatically
model = LogisticRegression()
model.fit(X_train, y_train)
Then deploy via UI or SDK. See 1-classical-ml.md.
If MCP tools are not available, use the SDK/CLI examples in the reference files below.
| Tool | Purpose |
|---|---|
manage_workspace_files (action="upload") | Upload agent files to workspace |
execute_code | Install packages, test agent, log model |
| Tool | Purpose |
|---|---|
manage_jobs (action="create") | Create deployment job (one-time) |
manage_job_runs (action="run_now") | Kick off deployment (async) |
manage_job_runs (action="get") | Check deployment job status |
| Action | Description | Required Params |
|---|---|---|
get | Check endpoint status (READY/NOT_READY/NOT_FOUND) | name |
list | List all endpoints | (none, optional limit) |
query | Send requests to endpoint | name + one of: messages, inputs, dataframe_records |
Example usage:
# Check endpoint status
manage_serving_endpoint(action="get", name="my-agent-endpoint")
# List all endpoints
manage_serving_endpoint(action="list")
# Query a chat/agent endpoint
manage_serving_endpoint(
action="query",
name="my-agent-endpoint",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=500
)
# Query a traditional ML endpoint
manage_serving_endpoint(
action="query",
name="sklearn-classifier",
dataframe_records=[{"age": 25, "income": 50000, "credit_score": 720}]
)
manage_serving_endpoint(action="get", name="my-agent-endpoint")
Returns:
{
"name": "my-agent-endpoint",
"state": "READY",
"served_entities": [...]
}
manage_serving_endpoint(
action="query",
name="my-agent-endpoint",
messages=[
{"role": "user", "content": "What is Databricks?"}
],
max_tokens=500
)
manage_serving_endpoint(
action="query",
name="sklearn-classifier",
dataframe_records=[
{"age": 25, "income": 50000, "credit_score": 720}
]
)
| Issue | Solution |
|---|---|
| Invalid output format | Use self.create_text_output_item(text, id) - NOT raw dicts! |
| Endpoint NOT_READY | Deployment takes ~15 min. Use manage_serving_endpoint(action="get") to poll. |
| Package not found | Specify exact versions in pip_requirements when logging model |
| Tool timeout | Use job-based deployment, not synchronous calls |
| Auth error on endpoint | Ensure resources specified in log_model for auto passthrough |
| Model not found | Check Unity Catalog path: catalog.schema.model_name |
WRONG - raw dicts don't work:
return ResponsesAgentResponse(output=[{"role": "assistant", "content": "..."}])
CORRECT - use helper methods:
return ResponsesAgentResponse(
output=[self.create_text_output_item(text="...", id="msg_1")]
)
Available helper methods:
self.create_text_output_item(text, id) - text responsesself.create_function_call_item(id, call_id, name, arguments) - tool callsself.create_function_call_output_item(call_id, output) - tool resultsnpx claudepluginhub databricks-solutions/ai-dev-kit --plugin databricks-ai-dev-kitManages Databricks Model Serving endpoints via CLI: create, configure, query, and maintain for LLM inference, custom ML models, and external models.
Create and manage Databricks Agent Bricks: Knowledge Assistants for document Q&A via RAG, Genie Spaces for natural language to SQL, and Supervisor Agents for multi-agent orchestration. Use for conversational AI apps on Databricks.
Executes Databricks ML workflow: Feature Store engineering, MLflow training/tracking, Unity Catalog registry, Mosaic AI serving for production inference.