From lyon
Lyon's DevOps/SRE engineering guidelines for production infrastructure. Covers Terraform IaC patterns (multi-account AWS, S3 backend, symlink module architecture), AWS EKS operations (Karpenter Helm chart patterns, service-to-NodePool mapping, cluster upgrades), Grafana LGTM observability stack (Loki distributed, Grafana, Tempo, Mimir, OpenTelemetry auto-instrumentation), Cilium eBPF networking (kube-proxy replacement, Tetragon runtime security, Hubble observability), ArgoCD GitOps (multi-env Application templates, AppProject isolation), Datadog integration, and infrastructure decision-making. Use when writing Terraform HCL, managing EKS clusters, configuring Cilium/Tetragon, designing observability pipelines, setting up ArgoCD GitOps, reviewing infrastructure PRs, or making DevOps architectural decisions.
How this skill is triggered — by the user, by Claude, or both
Slash command
/lyon:lyonThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Personal engineering guidelines for production infrastructure management.
Personal engineering guidelines for production infrastructure management. Optimized for AWS-centric, Kubernetes-native, observability-first environments.
references/terraform-patterns.md - Terraform IaC patterns and conventionsreferences/eks-operations.md - EKS cluster management, Karpenter Helm chart patternsreferences/observability.md - Grafana LGTM stack, Loki distributed, OTel, Datadogreferences/cilium-network.md - Cilium eBPF networking, Tetragon runtime securityreferences/gitops.md - ArgoCD GitOps patterns, multi-env Application templatesDirectory naming: terraform/aws/{ACCOUNT}/{ENV|general}/{SERVICE}/
terraform/
aws/
{account-a}/ # Account A
{env-1}/eks/
{env-2}/eks/
{env-N}/eks/
general/ # Cross-environment resources
datadog/
cloudwatch-alarm/
chatbot-slack/
{account-b}/ # Account B
{env-1}/eks/
{env-N}/eks/
general/
datadog/
modules/aws/ # Reusable modules
eks/
chatbot-slack/
cloudwatch-alarm/
datadog/
State management: Always use S3 backend with DynamoDB locking.
terraform {
backend "s3" {
bucket = "{company}-terraform-state"
key = "aws/{account}/{env}/eks/terraform.tfstate"
region = "ap-northeast-2"
dynamodb_table = "{company}-terraform-lock"
encrypt = true
}
}
Symlink architecture: For multi-environment EKS deployments, use symlinks from environment directories to common module files. This maintains per-environment state while sharing the same Terraform code.
# Environment-specific: terraform.tf, variables.tf, terraform.tfvars
# Symlinked from module: vpc.tf, eks.tf, karpenter.tf
ln -sf ../../../modules/aws/eks/vpc.tf .
ln -sf ../../../modules/aws/eks/eks.tf .
ln -sf ../../../modules/aws/eks/karpenter.tf .
validation {} blocks for user-facing variables~> for minor version flexibilitylocals {} for computed values, variable {} for configurable ones| Property | Standard |
|---|---|
| Kubernetes version | Latest stable (currently 1.33) |
| Node management | Karpenter v1 (preferred) + Managed Node Groups (system) |
| CNI | VPC CNI + Cilium (eBPF chaining) |
| Region | ap-northeast-2 |
| Auth | aws-auth ConfigMap -> EKS Access Entries migration |
# NodePool: prefer spot, fallback to on-demand
# Use consolidation policy for cost optimization
# Set disruption budgets to prevent mass eviction
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
budgets:
- nodes: "10%"
| Feature | Configuration |
|---|---|
| kube-proxy replacement | kubeProxyReplacement: true |
| CNI mode | AWS VPC CNI chaining (native routing, no overlay) |
| Hubble | DNS, drops, flows, TCP, HTTP metrics via eBPF |
| Tetragon | Runtime security — process exec, file access, privilege escalation detection |
| L7 proxy | Disabled (use Datadog/Tempo for L7 observability) |
kubectl apply on upper environments{service}-values.yaml| Component | Purpose | Priority |
|---|---|---|
| Grafana | Visualization & dashboards | Core |
| Loki | Log aggregation (LogQL) | Core |
| Tempo | Distributed tracing | High |
| Mimir | Long-term metrics storage | High |
| Alloy/OTel | Collection & routing | Core |
Design principle: Instrument once with OpenTelemetry, route to multiple backends.
For teams requiring commercial APM alongside open-source observability:
env, service, teamNeed monitoring?
-> Metrics: Prometheus/Mimir + Grafana dashboards
-> Logs: Loki + Grafana Explore
-> Traces: Tempo + Grafana Traces
-> Network: Cilium Hubble + Grafana (DNS, flows, drops)
-> Security: Tetragon + Grafana (process exec, file access)
-> APM (commercial): Datadog
-> AWS-native alerts: CloudWatch Alarm + Chatbot -> Slack
When reviewing infrastructure PRs:
Name, Environment, Team, ManagedBy: terraformdepends_on (let Terraform infer)| Criterion | Weight |
|---|---|
| Open-source & community-driven | High |
| Kubernetes-native | High |
| Terraform provider available | Medium |
| Active maintenance & CNCF/Grafana backing | Medium |
| Team familiarity | Medium |
terraform plan before any apply, check state driftProvides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
npx claudepluginhub yieon-lyon/lyon-skills --plugin lyon