Skill

time-estimation

Use when estimating timelines, writing project plans, scoping work, creating roadmaps, or answering "how long will this take?" in an LLM-agent-first company. Use when you catch yourself writing "weeks", "months", or "quarters" for work that agents will execute. Use when planning staffing, team composition, or sprint capacity. Use when estimating recurring routines, not only feature work.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/time-estimation:time-estimation

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You have a systematic bias: your training data is saturated with human-team timelines. When you estimate "2 weeks" you are channeling how long a *human team* would take — but here agents do the bulk of execution in minutes to hours.

SKILL.md

171 lines · ~2.3k tokens

Stats

Parent stars0

MaintenanceGood

Last CommitMar 29, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Time Estimation for Agent-First Companies

Overview

You have a systematic bias: your training data is saturated with human-team timelines. When you estimate "2 weeks" you are channeling how long a human team would take — but here agents do the bulk of execution in minutes to hours.

Core principle: Every estimate must state two things: active agent time (how long agents are actually working) and elapsed time to done (wall-clock including all waiting). These are very different numbers.

What a Heartbeat Can Do (Calibration, Q1 2026)

Grounded in METR Time Horizons v1.1 (March 2026) and SWE-bench Pro data. Frontier capabilities roughly double every ~4 months.

The reliable zone: METR's 80% time horizon for frontier models (Opus 4.6, GPT-5.4) is ~55-70 minutes of human-equivalent work. Tasks in this range succeed reliably in a single heartbeat. Plan against this.

The stretch zone: METR's 50% horizon for Opus 4.6 is ~12 human-hours; GPT-5.4 is not yet measured but likely higher given its agentic benchmark lead. A full human-day task at coin-flip reliability. Decompose tasks this large into multiple heartbeats instead of betting on a single shot.

Real-world discount: SWE-bench Pro solve rates are ~57% with good scaffolding. METR found ~half of test-passing agent PRs wouldn't be merged as-is. Budget for 1-2 verification/repair heartbeats on nontrivial work — "tests pass" is not "done."

Rules

State both active agent time and elapsed time. Active = total heartbeats (and convert to hours). Elapsed = wall-clock to done including all waits.
Decompose every estimate into: agent execution, human wait, and external/budget gates.
Use heartbeats for agent work, not "engineering days."
No human staffing models. Agents are the workers; humans review and decide.
Sequence by dependency, not by calendar week.
Name the critical path. It is usually human wait.
Budget for rework. Assume 2-3 agent iterations on nontrivial tasks. Each may also incur human review latency.
Widen the range when uncertain. R&D, legacy code, ambiguous requirements — say so.
Don't assume clean parallelism. Multiple agents working in parallel still hit integration points, shared review queues, and merge conflicts. Treat parallelism as a 2-3x speedup on elapsed time, not N-x. (See Parallelism section.)

Calibration Anchors

Agent execution

One heartbeat reliably handles ~55-70 min of human-equivalent work on well-scoped tasks (frontier model, strong scaffolding) — this is the METR 80% horizon. Include verification/repair loops in your count.

Task shape	Heartbeats (incl. verify/repair)	Active agent time
Small bounded change (bug fix, config, simple test)	1-2	~1-2 hours
CRUD API with DB migrations and tests	2-4	~2-4 hours
Third-party service integration	2-3	~2-3 hours
Comprehensive test suite for existing code	2-4	~2-4 hours
Bulk refactor across many files	2-5	~2-5 hours
New microservice from scratch	4-8	~4-8 hours
Full application (frontend + backend + tests)	8-20	~1-2.5 days
Service extraction from monolith	5-12 per service	~5-12 hours

For poorly-understood codebases or ambiguous requirements, multiply by 2-3x.

Parallelism

Multiple agents can work in parallel, but don't assume perfect scaling. Integration points, review queues, merge conflicts, and shared state create contention.

Independent tasks (separate repos, no shared state): 2-3x elapsed speedup is realistic
Tasks in the same repo/codebase: expect merge conflicts, context invalidation; 1.5-2x speedup is more honest
Tasks with shared dependencies: often must serialize at integration points

State your parallelism assumption explicitly. "N heartbeats across M agents" should include what blocks parallel execution.

Human wait

Step	Typical wait
Human code review / sign-off	2 hours - 1 day
Human answers a clarifying question	1 hour - 1 day
Human makes a strategic/architecture decision	1-3 days
External vendor/API access provisioning	1-5 days
Legal/compliance review	2-5 days
Staged production rollout observation	3-7 days (parallelizable across services)
SOC 2 audit (the audit itself, not writing docs)	4-8 weeks
Enterprise customer procurement/contract	2-8 weeks

Assume humans batch reviews, not that they respond instantly.

Recurring Routines

Not all estimation is for feature work. For recurring tasks (daily reports, weekly reviews, periodic migrations, scheduled maintenance):

Estimate per occurrence — active agent time + elapsed time for a single run
Justify the frequency separately — why daily vs weekly vs on-demand?
Include amortized setup cost — if the routine needs initial setup (templates, pipelines, permissions), estimate that separately from the per-run cost

Example: "Weekly dependency audit: 1 heartbeat (~1 hour active) per run, ~2 hours elapsed including review. Setup: 2-3 heartbeats one-time."

Instinct Check

When you catch yourself writing a human-scale number, check whether you've decomposed it:

If your instinct says	Ask: where is the time actually going?
1-2 weeks	Agent finishes in hours. Review cycles → likely 1-3 days
1 month	A few rounds of human feedback → likely 3-7 days. External gates → could be 2-4 weeks
1 quarter	Agents parallelize (with contention). Humans serialize. → likely 2-4 weeks. Heavy external deps → could be a quarter
6+ months	Only if regulatory gates, enterprise sales, or multi-month audits are on the critical path

These are guidelines. The actual answer depends on the decomposition.

When Estimates Are Legitimately Long

Not everything compresses. Be honest:

Regulatory/compliance (SOC 2 audits, PCI, FedRAMP) — irreducible external timelines
Enterprise sales cycles — procurement dominates "first customer live" roadmaps
R&D / experimentation — the approach may not work; pivots expected
Legacy systems — undocumented code, many rework cycles
External integrations — third-party access provisioning takes weeks
Staged rollouts — real traffic observation over days

State why it's long (which gates), not just a big number.

Output Format

Every estimate must include:

Active agent time — total heartbeats and equivalent hours/days
Elapsed time to done — wall-clock calendar time including all waits
Agent execution breakdown — heartbeats per subtask
Human wait breakdown — each human-gated step with duration
Critical path — name it
Assumptions — human response time, parallelism, budget, rework rounds
What could make it longer

Example:

Active agent time: ~8 heartbeats (~8 hours) Elapsed time: ~2 days

Agent: 8 heartbeats, 3 parallel agents (2-3x speedup, same-repo contention), ~4 hours elapsed compute

Human wait: 2 review rounds, ~1 day each assuming batch review

Critical path: human review cycles

Assumes: reviews within same business day, requirements are clear

Retrospective: Calibrate From Misses

After completing work, compare estimates to actuals. Track:

Field	Purpose
Estimated active agent time	What you predicted
Actual active agent time	What it took
Estimated elapsed time	Wall-clock prediction
Actual elapsed time	Wall-clock reality
Miss cause	What drove the gap (rework, human delay, scope change, discovery, integration issues)

Common miss patterns:

Active overestimate → tasks were simpler than expected; tighten anchors
Active underestimate → hidden complexity, poor codebase, ambiguous requirements; widen uncertainty multiplier
Elapsed overestimate → humans responded faster than assumed; adjust wait assumptions
Elapsed underestimate → blocked on human decisions, external gates, or integration contention you didn't model

Use misses to update your per-task heartbeat anchors over time. The calibration table in this skill is a starting point, not ground truth for your specific codebase and team.

Red Flags — Check Your Work

If you find yourself writing any of these, pause and decompose:

"A team of N engineers" — you're staffing a human team
"N engineering days" — you're using human-pace units
"Week 1 / Week 2 / Week 3" — you're planning in sprints, not dependencies
"Agents compress timelines by 30-50%" — agents replace execution, not accelerate human teams by a fraction
Any estimate missing active agent time or elapsed time
Any parallelism claim without stating what blocks it

time-estimation

Invocation

Context Preview

SKILL.md

time-estimation

Invocation

Context Preview

SKILL.md

Time Estimation for Agent-First Companies

Overview

What a Heartbeat Can Do (Calibration, Q1 2026)

Rules

Calibration Anchors

Agent execution

Parallelism

Human wait

Recurring Routines

Instinct Check

When Estimates Are Legitimately Long

Output Format

Retrospective: Calibrate From Misses

Red Flags — Check Your Work

Similar Skills

Time Estimation for Agent-First Companies

Overview

What a Heartbeat Can Do (Calibration, Q1 2026)

Rules

Calibration Anchors

Agent execution

Parallelism

Human wait

Recurring Routines

Instinct Check

When Estimates Are Legitimately Long

Output Format

Retrospective: Calibrate From Misses

Red Flags — Check Your Work

Similar Skills