From talos
This skill should be used when working with Talos Linux clusters, talosctl, or the Talos API. Covers machine configuration (v1alpha1), cluster bootstrap, Talos upgrades, Kubernetes version upgrades, boot asset building with imager, system extensions, networking (bonds, VLANs, VIPs, WireGuard, KubeSpan), etcd maintenance, troubleshooting, and disaster recovery. Triggers for queries like "upgrade my Talos cluster", "build a custom Talos ISO with extensions", "etcd is unhealthy", "node won't join the cluster", "configure bonding on Talos", "bootstrap a new Talos cluster", "reset a Talos node", "add a worker node", "restore etcd from snapshot", or "recover a failed control plane node".
How this skill is triggered — by the user, by Claude, or both
Slash command
/talos:talosThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill covers Talos Linux v1.13. All documentation references point to https://docs.siderolabs.com/talos/v1.13/.
This skill covers Talos Linux v1.13. All documentation references point to https://docs.siderolabs.com/talos/v1.13/.
Use Talos MCP tools — mcp__talos__* when the MCP is configured via user config, or mcp__plugin_talos_talos__* when installed via the plugin marketplace — for all Talos operations. Never shell out to talosctl unless the MCP tool is unavailable.
Use Kubernetes MCP tools (mcp__kubernetes-mcp-server__*) for all Kubernetes operations. Avoid kubectl unless the MCP tool is impractical (e.g., events — see below).
Use yq or jq for parsing YAML/JSON output. Avoid grep on structured data.
Avoid large results: MCP tool results that exceed the context window get dumped to temp files and become unusable. Always scope queries narrowly:
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -50 via Bash, or if the result was saved to a temp file, extract recent events with jq '[.[] | .text | fromjson] | sort_by(.lastTimestamp) | last(20)' <file>tail_lines to limit outputjq or yq via Bash to extract only what's needed, then retry with a narrower queryOperations requiring talosctl (no MCP equivalent — use via Bash):
talosctl gen secrets / talosctl gen config — generate cluster configurationtalosctl upgrade-k8s --to <version> — Kubernetes version upgrade (complex orchestration: patches configs, pre-pulls images, monitors rollout across all nodes)The Talos client config lives at ~/.talos/config (or $TALOSCONFIG). It contains contexts with endpoints and TLS credentials. Each MCP tool accepts an optional node (singular string) to target one specific node and context to select a talosconfig context. There is no nodes array — each call targets exactly one node; to fan out across a cluster, loop over your nodes and issue one tool call per node. Omitting node lets the request execute on whichever endpoint apid picks from the talosconfig (typically a control plane); that's fine for cluster-wide reads like talos_health or talos_etcd_members, but means per-worker tools (talos_addresses, talos_disks, talos_read, talos_ls, etc.) will return the endpoint's data, not the worker's.
Before any Talos operation, check if a local talosconfig file exists in the current working directory or project root. If found, base64-encode it via Bash (base64 < talosconfig) and call talos_set_config(content) with the base64 output. This preserves the exact file formatting (long base64 cert lines must not be wrapped). This is critical when working in project directories that have their own cluster configs.
Common TLS error: If you see x509: certificate signed by unknown authority or Ed25519 verification failure, this does NOT mean the certificates are incompatible. It means the talosconfig does not match the cluster — wrong config for the target cluster, stale config from a rebuilt cluster, or talos_set_config was not called. Fix: verify the correct talosconfig is loaded, re-run talos_set_config with the right file.
Single-cluster per session: The MCP server is stateful — talos_set_config sets the config for ALL subsequent calls in the session. It cannot operate on multiple clusters in parallel. To switch clusters, call talos_set_config again with the new config content. Use the context parameter on individual tools to switch between contexts within the same talosconfig.
Talos Linux is an immutable, API-driven, minimal Linux OS designed for Kubernetes. There is no SSH, no shell, no package manager. All management is via the Talos API (port 50000) using mutual TLS. The OS is read-only with an A/B partition scheme for atomic upgrades and rollback.
Key components: machined (init), apid (API gateway), trustd (certificate authority), etcd (on control plane nodes only).
Config: talos_set_config, talos_config_info, talos_machine_config
Cluster: talos_bootstrap, talos_health, talos_version, talos_members, talos_kubeconfig, talos_get
Node: talos_apply_config, talos_patch, talos_reboot, talos_shutdown, talos_reset, talos_upgrade, talos_rollback, talos_wipe
Services: talos_services, talos_service_restart, talos_containers, talos_stats, talos_image_list, talos_image_remove, talos_image_prune
Diagnostics: talos_logs, talos_dmesg, talos_processes
System: talos_disks, talos_mounts, talos_memory, talos_cpu, talos_disk_usage, talos_time
Network: talos_interfaces, talos_addresses, talos_routes, talos_netstat, talos_resolvers, talos_hostname
Storage: talos_volumes, talos_discovered_volumes
etcd: talos_etcd_members, talos_etcd_status, talos_etcd_snapshot, talos_etcd_defrag, talos_etcd_remove_member, talos_etcd_forfeit_leadership, talos_etcd_leave, talos_etcd_alarm
Filesystem: talos_ls, talos_read
Extensions: talos_extensions
Talos uses a single YAML configuration file (v1alpha1) with two top-level sections:
machine — node-specific: type (controlplane/worker), network, install disk, kubelet, files, kernel args, sysctls, extensionscluster — cluster-wide: control plane endpoint, cluster name, API server, etcd, discovery, CNI, inline manifestsGenerate configs: talosctl gen config <cluster-name> <endpoint> (via Bash)
Apply configs: talos_apply_config(config, mode) — requires the FULL machine configuration YAML. For partial changes, use talos_patch instead. Modes: auto (default), no-reboot, reboot, staged, try. Use insecure: true for nodes in maintenance mode.
Modify running configs with talos_patch(patch, node) — applies a strategic merge patch to the node's live config. Use $patch: delete to remove fields. Use dry_run: true to preview changes. See references/machine-config.md for full v1alpha1 structure.
talosctl gen secrets -o secrets.yaml (via Bash)talosctl gen config <cluster-name> <endpoint> --with-secrets secrets.yaml (via Bash)talos_apply_config(config, node, insecure=true) — fresh nodes are in maintenance mode, requires insecure: truetalos_apply_config(config, node, insecure=true)talos_bootstrap(node) on ONE CP node onlytalos_kubeconfig — retrieve kubeconfig (needed for health check)talos_health — verify cluster is upNote: talos_version, talos_disks, talos_get, and talos_apply_config support insecure: true for nodes in maintenance mode. Stop using insecure once the machine config is applied.
talos_upgrade is a single tool call that does the full talosctl-equivalent flow per node: cordon → drain → install → reboot → wait → uncordon. It returns when the node is fully back in service.
Internally, with auto_reboot=true (default):
k8s.io/kubectl/pkg/drain) — same code path as kubectl drain and talosctl upgrade --drain=true. PDB-aware (retries blocked evictions), skips DaemonSet/mirror/static pods, evicts emptyDir-using pods (node is rebooting anyway), uses each pod's own terminationGracePeriodSeconds.ImageService.Pull + LifecycleService.Upgrade on v1.13+ servers, or single-shot MachineService.Upgrade on older ones.Reboot RPC on v1.13+ (LifecycleService is install-only), or implied by the legacy upgrade RPC on <v1.13.c.Version with a saw-down-then-saw-up rule to detect the reboot, then polls the Kubernetes API for the node's Ready condition.K8s steps are skipped gracefully if the target isn't a Kubernetes member (e.g. fresh node not yet joined) — the upgrade still runs.
Steps:
talos_version, talos_health, list installed extensions (talos_extensions)talos_etcd_snapshot (recommended before any CP upgrade)talos_upgrade(node, image) — does the full cycle, returns when the node is back and uncordoned.
b. Verify (talos_health, talos_etcd_members) before moving on.talos_health, talos_version, talos_extensions (catches the "upgraded to stock image, lost extensions" mistake).Response shape (on auto_reboot=true success path):
api: "lifecycle" (v1.13+) or "legacy" (<v1.13)server_tag: detected pre-upgrade Talos tagpulled_image: canonical digest-pinned image (v1.13+ only)rebooted: true, talos_back: true, k8s_ready: true, uncordoned: truek8s_node_name: resolved Kubernetes node name (only when the target is a K8s member)stages: ordered list of progress messages ("cordoning node X", "evicting pod Y/Z", "talos reachable again", etc.)status: "ok"If something fails mid-flow, the response still includes everything observed (rebooted, talos_back, etc. as far as they got) plus an error field (wait_error, k8s_ready_error, uncordon_error) and a status like "rebooted_no_wait" or "installed_no_reboot". The uncordon is always attempted on a best-effort basis.
auto_reboot=false — install without activating:
Pass auto_reboot=false when you want to stage the install but defer activation (install during business hours, reboot during a maintenance window). This skips drain, reboot, wait, and uncordon — just the install runs. The new version is written to the alternate A/B partition with META updated; the node keeps running the current version until you trigger talos_reboot (or another talos_upgrade call with auto_reboot=true) manually.
Important upgrade rules:
talos_upgrade first invokes ImageService.Pull into the system containerd, then LifecycleService.Upgrade with the resolved (digest-pinned) image. Response includes pulled_image with the canonical reference./talos-image first and pass that image to talos_upgrade.talos_rollback is for reverting a successful but unwanted upgrade.stage: true (legacy <v1.13 primitive) defers the install to next reboot. Prefer the cross-version auto_reboot: false for the same effect on either path.Use talosctl upgrade-k8s --to <version> via Bash. This is a complex client-side orchestration that patches all nodes' configs, pre-pulls images, and monitors rollout. As of v1.13 this remains client-side — the new LifecycleService API covers Talos OS install/upgrade only, not the Kubernetes control-plane upgrade flow (which interleaves Talos API and Kubernetes API calls). Do NOT attempt to replicate this manually with talos_patch — use the talosctl command directly. Use --dry-run first to preview the plan. The command is resumable if interrupted.
Generate worker config for the cluster, apply to new node. It joins automatically via discovery.
For workers: talos_reset(node), then kubectl delete node <name> via Bash. For CP nodes: talos_reset(node, graceful=true) handles etcd departure automatically, then kubectl delete node <name>. Manual etcd leave/remove (talos_etcd_forfeit_leadership, talos_etcd_leave) is only needed for non-graceful resets or edge cases.
talos_reset(node, graceful=true) — cordon/drain, leave etcd if CP node, wipe disks, power down. After reset, delete the Kubernetes node object: kubectl delete node <name> via Bash.
Use the local imager container to build custom Talos images. As of v1.13 the imager runs rootless, so --privileged and -v /dev:/dev are only needed for bootable-media profiles (iso, metal, disk-image, cloud) that use loop devices. The installer profile does not need them. Always bind-mount a host directory to /out for output.
mkdir -p _out
docker run --rm -t \
-v "$PWD/_out:/out" \
ghcr.io/siderolabs/imager:v1.13.2 \
<output-type> \
--system-extension-image ghcr.io/siderolabs/<extension>:<tag>
Extension tags must match the Talos version — never use :latest. Look up matching tags at https://github.com/siderolabs/extensions/releases.
Output types: iso, metal, disk-image, installer, aws, azure, gcp, etc.
See references/boot-assets.md for extension list, overlay options, SecureBoot, and profiles.
Each node runs two containerd instances with separate image stores:
cri (the default) — kubelet's containerd, namespace k8s.io. Holds all Kubernetes workload images. Old images accumulate here because kubelet's GC is lazy (high/low disk watermarks). This is the common prune target.system — Talos's own containerd. Holds the installer image used during upgrades and system-extension images. Smaller, rarely needs pruning. Don't remove the installer image for the currently running version — talos_rollback needs it.Tools (all single-node-per-call):
talos_image_list(node, namespace?) — list cached images with name, digest, size.talos_image_remove(node, image, namespace?) — remove one image by ref. Equivalent to talosctl image remove.talos_image_prune(node, namespace?, dry_run=true) — list all images not in use by any running container; remove them when dry_run=false. Plugin-level helper (no talosctl equivalent). Always preview with the default dry-run before re-running with dry_run=false.Ref shapes in the image store: containerd indexes each cached image under three separate refs — the tag (docker.io/foo/bar:v1), the digest (docker.io/foo/bar@sha256:...), and the raw content ID (sha256:...). All three point at the same on-disk blob, so removing only the tag with talos_image_remove does not reclaim bytes — the digest and content-ID refs still pin the blob. To actually free space for one image, remove all three forms. talos_image_prune already does this automatically because it iterates the full candidate list, which is why prune is the right tool for "reclaim disk space" and individual remove is the right tool for "force a re-pull of a specific tag".
Extensions add functionality (drivers, tools, services) to Talos. They are baked into the boot image — not installed at runtime.
Three tiers: core (official, tested), extra (community, tested), contrib (community, best-effort).
Common extensions: iscsi-tools, qemu-guest-agent, intel-ucode, amd-ucode, nvidia-container-toolkit, tailscale, drbd.
Check installed: talos_extensions
Talos networking is configured in machine.network. Key concepts:
deviceSelector (preferred) or nameaddresses + routes) or DHCPbond, bridge, vlans configvip.ip)wireguard interface configKubeSpanConfig document as of v1.13 (.machine.network.kubespan is deprecated but still works); excludeAdvertisedNetworks filters which CIDRs are advertisednetworkRuleConfig resourcesSee references/networking.md for configuration patterns.
talos_etcd_memberstalos_etcd_statustalos_etcd_snapshot(output_path) — always do before upgrades/resetstalos_etcd_defrag — run on one node at a time, resource-heavytalos_etcd_remove_member(member_id) — required before resetting a CP nodetalos_etcd_forfeit_leadership — before maintenance on leader nodetalos_etcd_leave — graceful removaltalos_etcd_alarm — check for NOSPACE or other alarms--recover-from=<snapshot-path> (talosctl via Bash)trustd, auto-rotatedtalosctl gen secrets (CLI required) → apply new configos:admin, os:reader, os:etcd:backup, os:impersonator rolesWhen diagnosing issues, follow this order:
talos_health — overall cluster healthtalos_services — check service statestalos_logs(service, filter) — service-specific logs, use filter to search for specific texttalos_dmesg(filter) — kernel logs, use filter to search for specific drivers/errorstalos_etcd_members + talos_etcd_status + talos_etcd_alarm — etcd healthtalos_memory, talos_cpu, talos_disk_usage — resource pressuretalos_interfaces, talos_addresses, talos_routes — network statetalos_volumes, talos_discovered_volumes — storage statetalos_time — NTP sync statustalos_read, talos_ls — inspect files on nodeSee references/troubleshooting.md for common issues and solutions.
talos_reset(node, graceful=false, reboot=true, system_labels_to_wipe="EPHEMERAL")talosctl bootstrap --recover-from=<snapshot> (via Bash)talos_etcd_snapshot), add --recover-skip-hash-checktalosctl cp /var/lib/etcd/member/snap/db . via Bashnpx claudepluginhub inistor/claude-plugins --plugin talosCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.