From platform-design
Design the Kubernetes cluster topology — cluster placement per coordinate, node pool strategy, multi-tenancy model, and ResourceQuota tier templates — from the Platform Coordinate System. Produces a compute-design.md document used as input for manage-k8s-namespaces and cluster provisioning IaC modules. Use after design-segmentation, design-networking, and define-naming-convention.
How this skill is triggered — by the user, by Claude, or both
Slash command
/platform-design:design-computeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Design
Design
Produce a compute design document — a cloud-agnostic specification of the Kubernetes cluster topology, node pool strategy, multi-tenancy model, and quota template definitions. This document is the input for manage-k8s-namespaces and guides the Terraform/Pulumi IaC modules used for cluster provisioning.
When writing configurations or documentation, you MUST strictly adhere to the structural notation and types defined in the book. Before proceeding, read the following reference files:
references/notation.mdreferences/types.mdBefore proceeding, ask the user (or infer from context):
design-segmentation — Sectors, Tiers, Regions, and Tenants. Cluster placement maps to coordinates.design-networking — which spoke VPCs/VNets exist. Clusters live inside spokes.Map clusters to coordinates. A cluster typically corresponds to one (Sector, Tier, Region) coordinate — it lives within that spoke's network and inherits the tier's security posture.
One cluster per (Sector, Tier, Region) (pool model — recommended baseline):
One cluster per (Sector, Tier, Region, Tenant) (silo model — high isolation):
Bridge model (most common in practice):
For each cluster, specify:
k8s-ecommerce-live-eu01)(Sector, Tier, Region) or (Sector, Tier, Region, Tenant) for silo clustersEvery cluster needs at least one general-purpose node pool. Define the node pool strategy:
General-purpose pool (required):
m5.xlarge, Azure Standard_D4s_v5, GCP e2-standard-4)Specialized pools (define only if a workload type requires them):
dedicated=gpu:NoSchedule) and labels so workloads explicitly request them.dedicated={tenant}:NoSchedule taint.For each pool, specify:
For pool clusters (shared by multiple tenants), define the namespace-as-a-service model:
Namespace per tenant:
manage-k8s-iam)mountainlab.io/tenant, mountainlab.io/tier, etc.)Virtual clusters (vcluster):
Document the model chosen and which (if any) tenants use virtual clusters.
ResourceQuotas prevent a single tenant from exhausting cluster resources. Define a set of templates that tenants choose from, rather than configuring quotas per tenant (which creates inconsistency).
Recommended templates (adjust numbers based on node sizes):
| Template | CPU Requests | CPU Limits | Memory Requests | Memory Limits | Persistent Storage | Use Case |
|---|---|---|---|---|---|---|
xsmall | 2 | 4 | 4 Gi | 8 Gi | 10 Gi | Single-service or very small team |
small | 8 | 16 | 16 Gi | 32 Gi | 50 Gi | Small microservice team |
medium | 16 | 32 | 32 Gi | 64 Gi | 100 Gi | Standard product team |
large | 32 | 64 | 64 Gi | 128 Gi | 500 Gi | Large team or data-intensive service |
xlarge | 64 | 128 | 128 Gi | 256 Gi | 1 Ti | Platform or specialized high-load service |
LimitRange defaults (injected into every namespace to prevent unbounded containers):
defaultRequest:
cpu: "100m"
memory: "128Mi"
default:
cpu: "500m"
memory: "512Mi"
These defaults ensure that containers without explicit resource requests don't get scheduled with no limits.
Quota increase process: Define how tenants request a quota increase. Recommended: a pull request to the tenants repository changing the quota tier. The PR triggers a platform review conversation (is the team running out of capacity, or is there a scaling bug?). Friction here is intentional — it surfaces architecture conversations.
The cluster itself requires resources for platform components (metrics server, logging agents, network policy controller, ingress controller, cert-manager, ExternalDNS). These should run on a dedicated platform node pool to avoid competing with tenant workloads for resources.
dedicated=platform:NoSchedule. Platform components tolerate this taint; tenant workloads do not.For each cluster, define how external traffic reaches workloads:
Gateway resource; tenants manage HTTPRoute.Present the cluster topology, node pool design, and quota templates. Ask:
Iterate until satisfied.
Produce a Markdown document named compute-design.md:
# Compute Design
## Cluster Topology
| Cluster Name | Coordinate | Provider | Network Spoke | Multi-Tenancy Model |
|-------------|-----------|----------|--------------|---------------------|
| k8s-ecommerce-live-eu01 | ("ecommerce", "live", "eu01", _) | AKS | vnet-ecommerce-live-eu01 | Pool (all tenants) |
| k8s-payments-live-eu01 | ("ecommerce", "live", "eu01", "payments") | AKS | vnet-ecommerce-live-eu01 | Silo (PCI scope) |
## Node Pools
### Cluster: k8s-ecommerce-live-eu01
| Pool Name | Purpose | Instance Type | Min/Max | Spot? | Taints |
|-----------|---------|--------------|---------|-------|--------|
| general | Default tenant workloads | Standard_D4s_v5 | 3/20 | No | — |
| platform | Platform components | Standard_D2s_v5 | 2/3 | No | dedicated=platform:NoSchedule |
## Multi-Tenancy Model
[Pool / Silo / Bridge description with which tenants get dedicated clusters]
## Namespace Configuration
[Pre-configured components: ResourceQuota, LimitRange, NetworkPolicy, RBAC, Labels]
## Quota Templates
| Template | CPU Req | CPU Limit | Mem Req | Mem Limit | Storage |
|----------|---------|-----------|---------|-----------|---------|
| xsmall | 2 | 4 | 4 Gi | 8 Gi | 10 Gi |
| small | 8 | 16 | 16 Gi | 32 Gi | 50 Gi |
| medium | 16 | 32 | 32 Gi | 64 Gi | 100 Gi |
| large | 32 | 64 | 64 Gi | 128 Gi | 500 Gi |
| xlarge | 64 | 128 | 128 Gi | 256 Gi | 1 Ti |
### LimitRange Defaults
[Default CPU/memory requests and limits injected into every namespace]
### Quota Increase Process
[PR-based process and review policy]
## Edge Routing
[Ingress controller, ExternalDNS, cert-manager, Gateway API note]
## Assumptions and Open Questions
[...]
Ingress or HTTPRoute resources.This skill is grounded in Chapter 6: Infrastructure of Crafting Platforms.
npx claudepluginhub craftingplatforms/ai --plugin platform-designDesign Kubernetes clusters for scaling, service discovery, storage, and networking. Plan upgrades and multi-cluster strategies. Use when architecting container infrastructure.
Reviews Scaleway Kapsule managed Kubernetes cluster readiness: node pool sizing, CNI selection, placement groups, version upgrades, PDB coverage. Use for production readiness assessment or upgrade planning.
Designs Kubernetes platform architecture, implements GitOps workflows (ArgoCD/Flux), and plans multi-cluster strategy, service mesh, and security patterns.