From NVIDIA
Sets up, scales, validates, and recovers NVIDIA Physical AI infrastructure for synthetic data generation on MicroK8s or Azure AKS, including Kubernetes, inference endpoints, and OSMO deployment.
How this skill is triggered — by the user, by Claude, or both
Slash command
/nvidia-skills:physical-ai-infrastructure-setup-and-resilient-scalingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Canonical skill for the Physical AI infrastructure stack. Use it to compose cluster,
BENCHMARK.mdcomponents/azure-access/reference.mdcomponents/cluster-azure/reference.mdcomponents/cluster-azure/scripts/helmfile.yamlcomponents/cluster-azure/scripts/main.tfcomponents/cluster-azure/scripts/outputs.tfcomponents/cluster-azure/scripts/preflight.shcomponents/cluster-azure/scripts/setup.shcomponents/cluster-azure/scripts/storage-class-nfs.yamlcomponents/cluster-azure/scripts/system_node_capacity_test.shcomponents/cluster-azure/scripts/terraform.tfvars.examplecomponents/cluster-azure/scripts/variables.tfcomponents/cluster-azure/scripts/versions.tfcomponents/cluster-azure/terraform/main.tfcomponents/cluster-azure/terraform/modules/README.mdcomponents/cluster-azure/terraform/outputs.tfcomponents/cluster-azure/terraform/prerequisites/az-sub-init.shcomponents/cluster-azure/terraform/prerequisites/install-terraform.shcomponents/cluster-azure/terraform/prerequisites/register-azure-providers.shcomponents/cluster-azure/terraform/prerequisites/robotics-azure-resource-providers.txtCanonical skill for the Physical AI infrastructure stack. Use it to compose cluster, inference, OSMO, and workload stages into a reproducible Physical AI SDG environment, then keep the environment observable and recoverable.
${REPO_ROOT}/.env. Cluster-derived values such as storage,
database, Redis, and endpoint names come from Terraform outputs or platform
queries, not .env.secretKeyRef, and
runtime-only secret injection. Scan raw transcript exports with
scripts/scan_transcript_secrets.py before sharing.git rev-parse --show-toplevel.Each component lives inside this skill so the stack has one canonical trigger. Load the component reference only when the selected target needs that slice.
| Concern | Load | Assets |
|---|---|---|
| Stage matrix and old driver notes | components/driver/reference.md | None |
| MicroK8s cluster | components/cluster-microk8s/reference.md | components/cluster-microk8s/scripts/, components/cluster-microk8s/runtimeclass-nvidia-runc.yaml |
| Azure AKS cluster | components/cluster-azure/reference.md | components/cluster-azure/scripts/, components/cluster-azure/terraform/ |
| NIM Operator inference | components/inference-nim-operator/reference.md | components/inference-nim-operator/scripts/, components/inference-nim-operator/nims/ |
| NVCF inference | components/inference-nvcf/reference.md | components/inference-nvcf/scripts/ |
| Azure AI Foundry inference | components/inference-azure/reference.md | components/inference-azure/scripts/ |
| MicroK8s OSMO | components/osmo-k8s/reference.md | components/osmo-k8s/scripts/, upstream OSMO deploy scripts |
| Azure OSMO | components/osmo-azure/reference.md | components/osmo-azure/scripts/, upstream OSMO deploy scripts plus Azure TF outputs |
| Azure access setup | components/azure-access/reference.md | None |
| OSMO CLI and workflow operations | components/osmo-cli/reference.md | components/osmo-cli/scripts/, components/osmo-cli/references/, components/osmo-cli/agents/, components/osmo-cli/tests/ |
| OpenClaw Azure device login | components/openclaw-azure-login/reference.md | None |
The OSMO CLI component has second-level support files because its command and workflow surface is large. Load these directly only for the stated case.
| File | Read when |
|---|---|
components/osmo-cli/agents/workflow-expert.md | Spawning a workflow-generation or workflow-failure subagent. |
components/osmo-cli/agents/logs-reader.md | Spawning a log summarization subagent for OSMO workflow failures. |
components/osmo-cli/references/cli-commands.md | Exact OSMO CLI flags, payloads, or command syntax are needed. |
components/osmo-cli/references/workflow-spec.md | Workflow YAML schema, credentials, outputs, or provider fields are needed. |
components/osmo-cli/references/workflow-patterns.md | Multi-task, data dependency, Jinja, serial, or parallel workflow design is needed. |
components/osmo-cli/references/advanced-patterns.md | Checkpointing, retry/exit behavior, or node exclusion is needed. |
components/osmo-cli/tests/orchestrator-runtime-failure.md | Validating or debugging the OSMO orchestration review pattern. |
Pick exactly one option per stage. Stage 2 follows stage 1.
MicroK8s or AzureMicroK8s OSMO when Kubernetes is MicroK8s, Azure OSMO when
Kubernetes is AzureNIM Operator, NVCF, Azure AI Foundry, or NoneReject invalid combinations before provisioning:
| Cluster | NIM Operator | NVCF | Azure AI Foundry |
|---|---|---|---|
| MicroK8s | yes | yes | no, Foundry requires Azure identities |
| Azure | yes | yes | yes |
For OpenClaw or any chat-only environment that cannot open a browser, read
components/openclaw-azure-login/reference.md before Azure prerequisites.
For any Azure target, read components/azure-access/reference.md before Azure
component preflights.
scripts/preflight.sh for every selected infrastructure component plus
any OSMO CLI/workload preflight before provisioning; build the implementation
plan from the results and stop on red preflight.preflight_credentials.sh, pre_submit_guard.py with resolved --set
values, non-empty model-cache prefixes, and workflow-namespace endpoint
smoke checks.components/osmo-cli/reference.md; do not resubmit blindly.Avoid over-deploying expensive endpoints.
*.osmo-nims.svc.cluster.local, api.nvcf.nvidia.com/*,
*.inference.ai.azure.com, or *.cognitiveservices.azure.com.components/inference-nim-operator/nims/.components/inference-azure/scripts/install.sh.Each stage has its own Verify section in the component reference. These gates are mandatory:
| Stage | Gate |
|---|---|
| Kubernetes | Cluster API reachable, nodes Ready, GPU capacity advertised for GPU paths, and CPU+NVCF paths have runtimeclass/nvidia mapped to runc. |
| Inference | Every endpoint referenced by the workload is reachable. NIM readiness uses /v1/health/ready; NVCF and Foundry still need task-specific authenticated checks. |
| OSMO | OSMO pods Ready, pool ONLINE, port-forward watchdogs alive, storage credentials configured, and verify-hello workflow COMPLETED. |
| Workload | Selected workload pre-submit guards pass before submit. osmo workflow query <id> reports COMPLETED and every task is green. Failed terminal states require events and logs before retry. |
terraform apply.skills/physical-ai-video-data-augmentation/SKILL.md.skills/physical-ai-defect-image-generation/SKILL.md.skills/carline-adaptation/SKILL.md.skills/INDEX.md.Latest static review: 2026-05-26, description keywords match the expected routes above.
npx claudepluginhub nvidia/skills --plugin nvidia-skillsWalks through setting up AI Runway on an existing AKS cluster: cluster verification, controller install, GPU assessment, provider setup, and first model deployment.
Guides Azure Operator Service Manager development: troubleshooting onboarding issues, designing config groups, using ACR-backed artifacts, Private Link, and AOSM CLI. Activates when working with CNFs/VNFs or AOSM.
Launches GPU/TPU clusters, training jobs, and inference servers across 25+ clouds, Kubernetes, Slurm using SkyPilot; debugs YAML, optimizes costs.