Skill

nvidia-triton-inference-serving-review

Statically reviews Triton Inference Server deployments for model repository layout, config.pbtxt, dynamic batching, ensemble/BLS pipelines, custom backend trust, endpoint auth, response cache, and metrics exposure.

Python

C++

deployment

security

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/vanguard-frontier-agentic:nvidia-triton-inference-serving-review

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadGrepGlob

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Static review of Triton Inference Server deployments against NVIDIA's Triton documentation — model repository layout, dynamic batching, ensemble pipelines, custom backend trust, gRPC/HTTP authentication, model encryption at rest, response cache poisoning surface. This skill is doc-anchored: it grounds review findings in NVIDIA's published documentation rather than in a certification blueprint, ...

Supporting Files

metadata.json

SKILL.md

37 lines · ~774 tokens

Stats

LanguagePython

Stars18

Forks2

MaintenanceExcellent

Last CommitJun 15, 2026

Actions

View Source View Plugin View on GitHub View README

NVIDIA Triton Inference Server Review

Purpose

Lean operating rules

Prefer the user's actual model_repository/ tree and config.pbtxt files as evidence; otherwise fall back to documentation-based inference.
Treat custom Python or C++ backends loaded from non-pinned sources or without code review as a critical finding — in-process RCE.
Treat gRPC or HTTP endpoints exposed without authentication, mTLS, or a restricted-protocol gateway as a critical finding for multi-tenant deployments.
Treat model repository directories with world-writable permissions or a writable --model-repository mount as a high finding — silent model substitution.
Treat response caching enabled across tenants without per-request cache-key partitioning as a high finding — cross-tenant cache poisoning.
Treat ensemble or BLS pipelines that pass user-supplied tensors directly to a Python backend without input validation as a medium finding — deserialization surface.
Treat metrics endpoints (:8002) exposed to the public network without scraping ACLs as a medium finding — model name and shape leakage.
Treat dynamic batching max_queue_delay_microseconds left at default with latency SLOs in the millisecond range as a low finding — throughput-vs-latency tuning is wrong by default.
Always emit the exact tritonserver and perf_analyzer commands the user should run — do not execute them.

Response minimum

Return, at minimum:

the scoped target (model repository layout and provenance, backend trust posture, endpoint and auth posture, batching and ensemble posture, response cache and metrics posture, recommended tritonserver/perf_analyzer invocations) and evidence level,
findings labelled critical / high / medium / low,
recommended NVIDIA-tooling invocations the user should run themselves,
safe next actions and assumptions or blockers.

nvidia-triton-inference-serving-review

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

nvidia-triton-inference-serving-review

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

NVIDIA Triton Inference Server Review

Purpose

Lean operating rules

Response minimum

Similar Skills

NVIDIA Triton Inference Server Review

Purpose

Lean operating rules

Response minimum

Similar Skills