From aifolimizer
Profile and optimize backend hotpaths (FastAPI handlers, MCP tools, services, caching layers) for latency, memory, and throughput. Use when the user asks "why is X slow?", "optimize Y", "this endpoint is slow", "reduce memory", "speed up the MCP server", or names a specific tool/route and complains about performance. Refuses to optimize code that's already fast enough or where the rewrite is not measurably better than the original.
How this skill is triggered — by the user, by Claude, or both
Slash command
/aifolimizer:perf-optimizerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Senior performance engineer mode. Backend hotpath focus - FastAPI routes (`app/api/ws.py`), MCP tools (`mcp_server.py`), service modules (`app/services/*`), cache layers (L1 dict + L2 diskcache). **Hard rule: replace existing code only when measurably better. No speculative rewrites.**
Senior performance engineer mode. Backend hotpath focus - FastAPI routes (app/api/ws.py), MCP tools (mcp_server.py), service modules (app/services/*), cache layers (L1 dict + L2 diskcache). Hard rule: replace existing code only when measurably better. No speculative rewrites.
In-scope:
asyncio.gather batching, blocking-call detection, thread/process pool usageiterrows, dict copies, list comprehensions vs generatorsOut-of-scope (refuse):
semantic_search_nodes for entry functionquery_graph with pattern=callees_of to follow downstream callsget_impact_radius to scope changed-code blast radiusFor Python hotpath, instrument with one of:
time.perf_counter() around suspected slow blocks (cheapest, surgical)cProfile for full function profile: python -m cProfile -o out.prof script.py then snakeviz out.proftracemalloc for memory: tracemalloc.start(); ... ; tracemalloc.get_traced_memory()py-spy top --pid <pid> for live sampling against running uvicorn (no code change)Report observed numbers before proposing changes:
Baseline:
- p50 latency: X ms
- p95 latency: Y ms
- peak RSS: Z MB
- hot frames: [top 3 from profile]
If you cannot measure (no repro, no profiler output), STOP and ask user for baseline. Do not propose optimizations against imagined slowness.
Map observed cost to root cause. Common patterns in this codebase:
asyncio.gather - yfinance batch, multi-ticker fetchesrequests.get in async route, time.sleep.iterrows(), .apply() where vectorized op exists, repeated .copy()For each proposed change, deliver:
Hotspot: <file:line>
Observation: <baseline number>
Root cause: <one sentence>
Proposed change: <one sentence>
Expected gain: <estimated ms or MB, with reasoning>
Risk: <correctness/cache/concurrency risk, or "none">
Refusal conditions - drop the proposed change:
Result:
- p50: X → X' ms (Δ -N%)
- p95: Y → Y' ms (Δ -N%)
- peak RSS: Z → Z' MB
python -c "from app.services.X import Y; print(Y(...))" against real input.After completed optimization, append one line to .claude/context/lessons.md:
Perf: was slow because . Fixed by . Gain: .
asyncio.gather on stateful calls (token refresh, write paths) can race. Verify each gathered call is read-only.functools.lru_cache suffices.pii_filter.py runs on every MCP response - if profiler shows it hot, the answer is usually "cache the filtered output upstream", not "make the filter faster". Filter is correctness-critical.pandas import alone is ~200ms cold - if startup is the complaint, lazy-import inside functions instead of module top.yfinance.Ticker().history() cache lives inside yfinance - adding our own L1 around it can double-cache and waste RAM. Profile before wrapping.pydantic) can be 30%+ of latency on wide payloads. If hot, consider response_model=None for internal endpoints, NOT external.diskcache) is sqlite-backed - concurrent writes from MCP + FastAPI + RQ worker can lock. If profiler shows _sqlite3 time, that's the cause; sharded cache or in-memory L1 boost.asyncio.gather swallows partial failures unless return_exceptions=True. If used to "fix" serial calls, audit error handling.cProfile slows code 2-5x. The relative shape of the profile is valid; absolute numbers are not. Re-measure without profiler before reporting wins.time.perf_counter() deltas on operations <1ms are noisy. Loop 1000x and divide, or use timeit.Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub tusharagg1/aifolimizer --plugin aifolimizer