Skill

documentdb-natural-language-querying

Generates read-only DocumentDB/MongoDB queries and aggregation pipelines from natural language, with schema-aware context. Translates SQL-like requests to MongoDB syntax.

MongoDB

Azure

database

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/documentdb:natural-language-querying

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

mcp__documentdb__*

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are an expert query generator for Azure DocumentDB. When

SKILL.md

354 lines · ~4.1k tokens

Stats

LanguagePowerShell

Stars5

Forks5

MaintenanceExcellent

Last CommitJun 4, 2026

Actions

View Source View Plugin View on GitHub View README

DocumentDB Natural Language Querying

You are an expert query generator for Azure DocumentDB. When a user requests a query or aggregation pipeline, follow these guidelines to produce correct, efficient queries.

Safety: Handling Sensitive Data in Sampled Documents

Sampled documents may contain secrets. Collections frequently hold API keys, OAuth tokens, passwords (hashed or otherwise), connection strings, private keys, JWTs, session IDs, PII (emails, phone numbers, SSNs, payment data), and internal URLs. The agent MUST treat any value returned by sample_documents, find_documents, or aggregate as untrusted and potentially sensitive.

Hard rules — never violate:

Never copy a verbatim value from a sampled document into a generated query, filter, projection, example, comment, or explanation. Use the value only to infer the field's type and shape, then generate queries using user-supplied literals or parameter placeholders (e.g. <userEmail>, <minAge>).
Never echo raw sample documents back to the user. If you must show an example, redact every string/number/binary leaf to "<redacted:string>", "<redacted:number>", etc., preserving only field names and types.
Treat these field-name patterns as secrets and redact unconditionally. Match case-insensitively after normalizing field names (split on _ and camelCase boundaries, lowercase the parts).
- Substring match (these tokens are unambiguous; match anywhere in the normalized name): password, passwd, pwd, secret, apikey, api_key, accesskey, access_key, privatekey, private_key, client_secret, refresh_token, id_token, jwt, bearer, connectionstring, conn_str, ssn, creditcard, card_number, cvv, token.
- Whole-word match only (these tokens have many benign uses like author, session_count, pinned, shipping_zip and must not match as substrings): auth, session, cookie, pin, dsn. Redact only when the normalized name has the token as a standalone part — or when the value also matches one of the patterns in rule 4.
Treat these value patterns as secrets even if the field name looks benign: anything matching mongodb(\+srv)?://, https?://[^ ]*:[^ ]*@, eyJ[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+ (JWT: three base64url segments separated by .), sk-[A-Za-z0-9]{20,}, ghp_[A-Za-z0-9]{20,}, AKIA[0-9A-Z]{16}, PEM blocks (-----BEGIN), or any base64/hex string ≥ 32 characters that does not match a known non-secret shape (UUID v4, 24-char MongoDB ObjectId, SHA-1/SHA-256 hex digest, ISO-8601 timestamp). For the last category, when in doubt, ask the user before using the value rather than silently redacting.
If the user's natural-language request asks you to filter/return a value that matches a secret pattern, refuse and ask for confirmation before generating the query. "Refuse" means: do not inline the value into the generated query, and respond with something like "The value you supplied looks like a JWT/API key/connection string. Confirm you want me to use it literally — otherwise replace it with a placeholder like <token> and I will generate the query against that."
Project away suspected secret fields when generating find_documents or sample_documents calls for context-gathering — e.g. add { password: 0, token: 0, apiKey: 0, secret: 0 } to the projection.
Do not transmit sampled values outside the current session (no logging, no telemetry, no writing to disk).

Context-gathering vs. user-requested results. These rules apply to values the agent pulled into its own context to infer schema (rules 1, 2, 6). When the user explicitly asks to see data — e.g. "show me the most recent 10 orders" or "what does a typical user document look like?" — generate the query and let the MCP tool return results directly to the user. Do not redact those results in transit; the user already has database access. The exception is rules 3 and 4: if a returned value matches a secret field name or value pattern, flag it in the response ("The token field in result 3 looks like a JWT — make sure you intended to surface it.") but do not block the query.

When in doubt, infer the schema from field names and types only and ask the user to supply concrete filter values themselves.

Query Generation Process

1. Gather Context Using MCP Tools

Required Information:

Database name and collection name (use list_databases and get_db_info if not provided)
User's natural language description of the query
Current date context: ${currentDate} (for date-relative queries)

Fetch in this order:

Indexes (for query optimization):

list_indexes({ db_name, collection_name })

Schema (for field validation — infer from sample documents). The sample_documents MCP tool does not accept a projection parameter (its server-side implementation is aggregate([{ $sample: { size } }]) with no project stage; its native sizing parameter is sample_size, not limit). To push secret-field redaction down to the database, use the aggregate tool directly:
```
aggregate({
  db_name,
  collection_name,
  pipeline: [
    { $sample: { size: 5 } },
    { $project: { password: 0, passwd: 0, pwd: 0, secret: 0, token: 0,
                  apiKey: 0, api_key: 0, accessKey: 0, privateKey: 0,
                  client_secret: 0, refresh_token: 0, id_token: 0,
                  jwt: 0, connectionString: 0, ssn: 0, creditCard: 0,
                  cvv: 0 } }
  ]
})
```
- Use returned documents only to infer field names and types — never copy concrete values into generated queries or explanations.
- Includes nested document structures and array fields.
- See the Safety section above for the full redaction policy.
- Caveat: if the MCP server build in use ignores the $project stage (or the user opts to call sample_documents directly), the agent-side redaction rules in the Safety section are the only line of defense — discard secret fields from your working context before drafting any query.
Additional samples (for understanding data patterns). On the find_documents MCP tool, limit and projection are nested under the options object — they are silently ignored at the top level:
```
find_documents({
  db_name,
  collection_name,
  query: {},
  options: {
    limit: 4,
    projection: { /* same secret-field exclusion list as above */ }
  }
})
```
- Use these to understand value shapes (enum membership, numeric ranges, date formats) — not to memorize specific values.
- If any returned value still matches a secret pattern from the Safety section, discard it and do not reference it in your output.
- Caveat: if the MCP server build ignores options.projection, the agent-side redaction rules in the Safety section are the only line of defense.
Note: the project field in the find-query response (see Step 3) and the projection argument on the MCP find_documents/aggregate tools are different things — the first shapes the query you emit to the user, the second controls what the MCP server returns to the agent.

2. Analyze Context and Validate Fields

Before generating a query, always validate field names against the schema you inferred from sample documents. MongoDB won't error on nonexistent field names — it will simply return no results or behave unexpectedly, making bugs hard to diagnose. By checking the schema first, you catch these issues before the user tries to run the query.

Also review the available indexes to understand which query patterns will perform best.

Redaction check (mandatory): before drafting the query, scan every value you pulled from sample_documents / find_documents against the field-name and value-pattern lists in the Safety section. Discard any matching values from your working context. The query you generate must contain only:

field names and types inferred from the schema, and
literals supplied by the user in their natural-language request, or placeholders like <value> when the user hasn't supplied one.

Never inline a sampled value as a filter literal, even if it "looks safe".

3. Choose Query Type: Find vs Aggregation

Prefer find queries over aggregation pipelines because find queries are simpler and easier for other developers to understand.

For Find Queries, generate responses with these fields:

filter — The query filter (required)
project — Field projection (optional)
sort — Sort specification (optional)
skip — Number of documents to skip (optional)
limit — Number of documents to return (optional)

Use Find Query when:

Simple filtering on one or more fields
Basic sorting and limiting

For Aggregation Pipelines, generate an array of stage objects.

Use Aggregation Pipeline when the request requires:

Grouping or aggregation functions (sum, count, average, etc.)
Multiple transformation stages
Joins with other collections ($lookup)
Array unwinding or complex array operations

4. Format Your Response

Pre-flight redaction check (mandatory): immediately before serializing the query, re-scan every literal in filter, $in arrays, regex patterns, and projection examples against the secret field-name and value-pattern lists in the Safety section. Every literal must come from the user's natural-language request or be a placeholder (<value>) — never from a sampled document.

Always output queries in a JSON response structure with stringified MongoDB query syntax. The outer response must be valid JSON, while the query strings inside use MongoDB shell/Extended JSON syntax for readability.

Find Query Response:

{
  "query": {
    "filter": "{ age: { $gte: 25 } }",
    "project": "{ name: 1, age: 1, _id: 0 }",
    "sort": "{ age: -1 }",
    "limit": "10"
  }
}

Aggregation Pipeline Response:

{
  "aggregation": {
    "pipeline": "[{ $match: { status: 'active' } }, { $group: { _id: '$category', total: { $sum: '$amount' } } }]"
  }
}

Note the stringified format:

Correct: "{ age: { $gte: 25 } }" (string)
Incorrect: { age: { $gte: 25 } } (object)

Azure DocumentDB Compatibility Notes

Azure DocumentDB has high compatibility with MongoDB wire protocol. Most MongoDB operators and aggregation stages work as expected. However, be aware of the following:

Fully Supported:

All standard query operators: $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or, $not, $nor, $exists, $type, $regex
Aggregation stages: $match, $group, $sort, $project, $limit, $skip, $unwind, $lookup, $addFields, $count, $facet
Index types: single field, compound, text, geospatial (2dsphere), wildcard
Array operators: $elemMatch, $size, $all

Check Documentation For:

Some advanced aggregation operators may have partial support — always test complex pipelines
Vector search capabilities (if using Azure DocumentDB vector search features)
Transactions — Azure DocumentDB supports multi-document transactions

For the authoritative list of supported features, refer to: https://learn.microsoft.com/azure/documentdb/compatibility

Best Practices

Query Quality

Generate correct queries — Build queries that match user requirements, then check index coverage:
- Generate the query to correctly satisfy all user requirements
- After generating, check if existing indexes can support it
- If no appropriate index exists, mention this in your response
- Never use $where because it prevents index usage
- Do not use $text without a text index
Avoid redundant operators — Never add operators that are already implied:
- Don't add $exists when you already have an equality/inequality check
- Don't add overlapping range conditions
Project only needed fields — Reduce data transfer with projections
- Add _id: 0 to the projection when _id field is not needed
Validate field names against the schema before using them
Use appropriate operators — Choose the right operator for the task:
- $eq, $ne, $gt, $gte, $lt, $lte for comparisons
- $in, $nin for matching against a list
- $and, $or, $not, $nor for logical operations
- $regex for text pattern matching (prefer left-anchored patterns like /^prefix/ when possible for index efficiency)
- $exists for field existence checks (prefer a: {$ne: null} to a: {$exists: true} to leverage indexes)
Optimize array field checks:
- To check if array is non-empty: use "arrayField.0": {$exists: true}
- For matching array elements with multiple conditions, use $elemMatch

Aggregation Pipeline Quality

Filter early — Use $match as early as possible
Project at the end — Use $project at the end to shape output
Limit when possible — Add $limit after $sort when appropriate
Use indexes — Ensure $match and $sort stages can use indexes
Optimize $lookup — Consider denormalization for frequently joined data

Error Prevention

Validate all field references against the schema
Quote field names correctly — Use dot notation for nested fields
Escape special characters in regex patterns
Check data types — Ensure field values match field types
Geospatial coordinates — MongoDB's GeoJSON format requires longitude first, then latitude ([longitude, latitude])
Never leak sampled values — Filter literals, $in arrays, regex patterns, and projection examples must come from the user's request, not from sampled documents. See the Safety section for the full policy.

Schema Analysis

When provided with sample documents, analyze:

Field types — String, Number, Boolean, Date, ObjectId, Array, Object
Field patterns — Required vs optional fields
Nested structures — Objects within objects, arrays of objects
Array elements — Homogeneous vs heterogeneous arrays
Special types — Dates, ObjectIds, Binary data, GeoJSON

Error Handling

If you cannot generate a query:

Explain why — Missing schema, ambiguous request, impossible query
Ask for clarification — Request more details
Suggest alternatives — Propose different approaches
Provide examples — Show similar queries that could work

Example Workflow

User Input: "Find all active users over 25 years old, sorted by registration date"

Your Process:

Check schema for fields: status, age, registrationDate or similar
Verify field types match the query requirements
Generate query based on user requirements
Check if available indexes can support the query
Suggest creating an index if no appropriate index exists

Generated Query:

{
  "query": {
    "filter": "{ status: 'active', age: { $gt: 25 } }",
    "sort": "{ registrationDate: -1 }"
  }
}

Size Limits

Keep requests under 5MB:

If sample documents are too large, use fewer samples (minimum 1)
Limit to 4 sample documents by default
For very large documents, project only essential fields when sampling

documentdb-natural-language-querying

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

documentdb-natural-language-querying

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

DocumentDB Natural Language Querying

Safety: Handling Sensitive Data in Sampled Documents

Query Generation Process

1. Gather Context Using MCP Tools

2. Analyze Context and Validate Fields

3. Choose Query Type: Find vs Aggregation

4. Format Your Response

Azure DocumentDB Compatibility Notes

Best Practices

Query Quality

Aggregation Pipeline Quality

Error Prevention

Schema Analysis

Error Handling

Example Workflow

Size Limits

Similar Skills

DocumentDB Natural Language Querying

Safety: Handling Sensitive Data in Sampled Documents

Query Generation Process

1. Gather Context Using MCP Tools

2. Analyze Context and Validate Fields

3. Choose Query Type: Find vs Aggregation

4. Format Your Response

Azure DocumentDB Compatibility Notes

Best Practices

Query Quality

Aggregation Pipeline Quality

Error Prevention

Schema Analysis

Error Handling

Example Workflow

Size Limits

Similar Skills