capacity-planning | argos

Stats

Actions

Tags

capacity-planning | argos

Capacity Planning

Ortak Doktrin

agents/shared/severity-rubric.md ve agents/shared/escalation-matrix.md default-load sayılır (agents/coordination.md §11). Bu skill'in çıktısı Critical / High / Medium / Low + kanıt formatında olmak zorunda — spekülatif Critical yasak. Sahiplik dışı bulgu ilgili agent'a delege; karar yetkisi eşiği aşılırsa kullanıcı onayı zorunlu.

Felsefe

Ölçmeden tahmin etme. "Bu yeterli olur" yasak; baseline + load test kanıt.
Headroom planla — pik trafiğin %30-50 üstü. Tam doluluk = incident.
Bottleneck'i bul, kaynağı çözme. CPU max'sa daha büyük instance ≠ çözüm (algoritma / N+1 / kötü query olabilir).
Cost projection — kapasite = $$. ROI hesabı.
Forecast — haftalık, mevsimsel, kampanya bazlı.

Ne Zaman Kullanılır

Yeni servis prod'a çıkacak (capacity gate)
Trafik artışı planlı (Black Friday / kampanya / yeni müşteri)
Performans regresyonu sonrası baseline yeniden
Cost optimization (right-sizing, idle waste)
Autoscaling threshold tuning (HPA/VPA/Cluster Autoscaler)
Capacity assertion (SLA için "X kullanıcı concurrent" garantisi)
Architecture review öncesi (mimari tartışmasında "şu an ne kaldırıyor?")

Workflow

1) Baseline ölç

Mevcut sistem ne kaldırıyor? Test öncesi metric:

Metric	Tool	Pencere
RPS (peak / avg)	Prometheus `rate(http_requests_total[5m])`	7 gün
p50/p95/p99 latency	Prometheus histogram	7 gün
Error rate	Prometheus 5xx oran	7 gün
CPU/memory utilization	Prometheus `node_`, `container_`	7 gün
DB QPS + connection pool	postgres_exporter	7 gün
Cache hit rate	redis_exporter / app metric	7 gün
Concurrent users	unique session 5dk pencere	7 gün

Saturation'ı bul:

CPU >%70 sustained = saturated.
Memory >%80 + GC pressure.
DB connection pool full + idle in transaction artıyor.
Cache hit < %85 (target değişir, baseline ile karşılaştır).
Network egress >%70 of NIC.
Disk I/O queue depth >10.

2) Yük testi tasarımı

Tool seçimi

Tool	Uygun
k6	HTTP/HTTPS, gRPC, WebSocket; JS scripting; Grafana entegre. Modern default.
Locust	Python; UI; distributed runner; complex scenario için.
Gatling	Scala; high-throughput; raporlama güçlü.
JMeter	Java; Legacy; UI'dan tasarım; complex SOAP/JMS.
wrk	C; mikro-benchmark, single endpoint.
Artillery	Node.js; YAML scenario; multi-protocol.

Plugin tercih: k6 (HTTP+WS+gRPC, modern, code-first).

Test profilleri

// k6 örnek
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  scenarios: {
    // 1. Smoke — küçük yük, sistem ayakta mı?
    smoke: {
      executor: 'constant-vus',
      vus: 5, duration: '2m',
      tags: { test_type: 'smoke' },
    },
    // 2. Load — beklenen trafik
    load: {
      executor: 'ramping-vus',
      stages: [
        { duration: '5m', target: 100 },
        { duration: '20m', target: 100 },
        { duration: '5m', target: 0 },
      ],
      tags: { test_type: 'load' },
    },
    // 3. Stress — kapasite sınırı bul
    stress: {
      executor: 'ramping-vus',
      stages: [
        { duration: '5m', target: 100 },
        { duration: '5m', target: 200 },
        { duration: '5m', target: 400 },
        { duration: '5m', target: 800 },
        { duration: '10m', target: 1500 },
        { duration: '5m', target: 0 },
      ],
      tags: { test_type: 'stress' },
    },
    // 4. Spike — ani artış
    spike: {
      executor: 'ramping-vus',
      stages: [
        { duration: '10s', target: 50 },
        { duration: '30s', target: 1000 },   // ani spike
        { duration: '3m', target: 50 },
        { duration: '10s', target: 0 },
      ],
      tags: { test_type: 'spike' },
    },
    // 5. Soak — uzun süreli (memory leak / leak)
    soak: {
      executor: 'constant-vus',
      vus: 100, duration: '4h',
      tags: { test_type: 'soak' },
    },
  },
  thresholds: {
    'http_req_duration{test_type:load}': ['p(99)<500'],
    'http_req_failed': ['rate<0.001'],
  },
};

export default function () {
  const res = http.get('https://api.example.com/orders');
  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(1);
}

Senaryo gerçekçilik

Realistic mix: GET %70, POST %20, PUT/DELETE %10 (servise göre).
Think time: 1-3 sn (gerçek user pause).
Data variety: faker/CSV; her VU farklı veri.
Auth: session token cache (her request login değil).
External dependency: mock vs. real (cost vs. realism).

3) Test ortamı

Production benzeri (resource, replica, network).
Ayrı namespace veya cluster — prod'a sızma yok.
Test data: anonymized prod snapshot ya da synthetic.
Network: aynı region (latency baseline).
Idempotent: test sonrası state temiz.

4) Analiz

Bottleneck triage tree

RPS limit'e ulaştı, latency artıyor
   ↓
CPU >%80? ─── evet ──→ saturated; profiling (pprof, py-spy) → algoritma optimize / scale out
        │
        hayır
        ↓
Memory >%85? ─── evet ──→ leak veya buffer; heap snapshot
              │
              hayır
              ↓
DB pool full? ─── evet ──→ pool size, slow query (EXPLAIN), N+1, indeks
              │
              hayır
              ↓
Cache miss high? ─── evet ──→ TTL, key strategy, warm-up
                  │
                  hayır
                  ↓
Network egress full? ─── evet ──→ NIC limit, payload size, compression
                       │
                       hayır
                       ↓
Lock contention? ──→ profile lock wait (mutex profiling)

5) Headroom + autoscaling

Headroom hesabı

peak_observed = en yüksek 7 gün p95 RPS
target_capacity = peak_observed × (1 + headroom_pct)
provisioned = target_capacity / per_replica_rps × safety_factor

Tipik:

headroom_pct = 0.30-0.50 (%30-50 ek alan).
safety_factor = 1.2 (replica fail tolerans).

HPA tuning

spec:
  minReplicas: 3                    # P&E (1 alan, 1 fail tolerans)
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60     # %60 hedef → headroom var
    # custom metric:
    - type: Pods
      pods:
        metric: { name: rps_per_pod }
        target: { type: AverageValue, averageValue: '100' }
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0   # hızlı scale-up
      policies:
        - { type: Percent, value: 100, periodSeconds: 15 }
    scaleDown:
      stabilizationWindowSeconds: 300  # yavaş scale-down (flapping önle)
      policies:
        - { type: Percent, value: 10, periodSeconds: 60 }

6) Cost projection

monthly_cost = replicas × instance_$$/hour × 730 hours
              + database_class_$$
              + load_balancer_$$
              + egress_$$ / month
              + storage_$$ / month

Δ = projected - current

Right-sizing: idle replicas → scale down baseline.
Spot/preemptible non-critical workload.
Reserved instance baseline yük.
Egress — cross-region/cloud yüksek; co-locate.

7) Seasonal / kampanya forecast

Trend: 3 ay ortalama büyüme % (haftalık).
Peak: Black Friday / yıl sonu / kampanya — geçmiş veri × growth.
Lead time: provision süresi (managed DB upgrade saatler/gün).
Drill: kampanya öncesi load test ile doğrula.

8) Yayın

Capacity report: docs/capacity/<service>.md
Quarterly review: trend + drill sonucu + autoscaling tuning.
Living doc: incident sonrası update.

Checklist

Antipattern

Yük testi sadece happy path — error path da test (timeout, retry, circuit).
Production'da load test — izole environment.
VU = user denklemi — VU ≠ user; concurrent action sayısı.
Threshold yok — fail/pass kriteri belirsiz.
Tek profil (load only) — stress + spike + soak eksik.
Saturation = scale up — root cause atla.
HPA target %80 — flapping + pod start lag, %50-60 hedefle.
stabilizationWindowSeconds=0 scale-down — flapping.
Cost projection yok — kapasite ekleme bütçe sürprizi.
Headroom %0 — tam doluluk; tek pod fail = outage.
Reserved instance baseline yok — ondemand %100 = pahalı.
Test sonucu kaydedilmemiş — trend yok, regression görünmez.
"Şu an yeterli" — büyüme % hesabı yok.

Örnek Agent Davranışı

User: /capacity-plan api-svc
Agent (load-test-engineer):
1. Baseline 7 gün:
   - RPS peak 850 (16:00 UTC), avg 320
   - p99 latency 380ms peak, 180ms avg
   - CPU avg %42, peak %71 (saturation yakın)
   - DB pool 60/100 peak
   - Cache hit %91
2. Saturation: CPU peak %71'de p99 480ms'e fırlıyor → saturation point.
3. k6 stress test plan:
   - 100 → 1500 VU 30 dk
   - threshold: p99<500ms, error<0.001
4. Sonuç:
   - 1200 VU'da CPU %92 saturated, p99 1.2s
   - DB pool 95/100, slow query `SELECT * FROM orders WHERE customer_id...`
     (no index, seq scan)
5. Bottleneck: önce DB indeks, sonra CPU. Indeks ekle → 1800 VU dayanıyor.
6. Headroom %40: peak 850 RPS × 1.4 = 1190 RPS hedef.
7. HPA: min 5 (peak öncesi warm), max 30, CPU target %55.
8. Cost: 5→8 replica avg, peak 24. Δ +$420/ay.
9. Seasonal: Q4 +%70 hedef → max 30→50 replica + DB read replica ekle.
10. Drill: 2026-11-01'de Q4 öncesi load test re-run.

Çıktı Formatı

# Capacity Plan: <service>

## Baseline (7 gün)
| Metric | Peak | Avg | p95 | p99 |

## Saturation Point
- CPU/memory/DB/cache/network

## Test Sonuç
- smoke / load / stress / spike / soak — pass/fail + max sustained

## Bottleneck
- Triage (kaynak değil sebep)

## Capacity Plan
- Headroom %, safety factor, target replicas

## HPA Config
```yaml
# tuned spec

Cost Projection

current: $X/ay
projected: $Y/ay (Δ +$Z)

Seasonal Forecast

Q4 / kampanya hedef

Action Items

| Öncelik | Aksiyon | Sahip | Bitiş | Issue |