Capture And Recovery Eval

CARE benchmark

Test data

Capture And Recovery Eval. Can a fresh agent recover a plan's full intent — the why, not just the what — from the plan alone? Click for the full story

Top By Intent: Loading--
Candidates: ----
Families: ----
Runs: ----

Candidate summaries

Intent Map

Axes

X Y Rank

Display

Current leaders

Scoreboard

Total intent recovery

Candidates

0 selected

Reference scores

Public Benchmark Library

Known source data

Model / Benchmark Matrix

Contextual resource data

Browse all candidates

Capture And Recovery Eval

The CARE benchmark

Every coding agent starts by turning a spec into a plan — and a plan is a lossy handoff. It captures what to build but quietly sheds the why: the trade-offs, constraints, and reasoning that justified each decision. CARE measures how much of that intent survives the planning step — and whether a fresh agent can recover it from the plan alone.

In real agentic pipelines, plans get passed between sessions, agents, and people. When the reasoning behind a plan evaporates, downstream work drifts — rebuilding the letter of the spec while losing its intent, often with no one noticing. CARE is a stress test for that durability: does the why travel with the plan, or leak out the moment its author's context is gone?

01

Capture

A candidate model reads a product spec and writes an implementation plan.
02

Recovery

A fresh agent — shown only that plan, never the spec — reconstructs the original intent and rationale from it alone.
03

Eval

A scorer measures the gap between what the plan stated on its surface and what intent was actually recoverable.

What the score means

Each spec ships with a set of gold whys — the rationale behind every requirement — weighted by importance and split into system-level intent (the decisions that shape the architecture) and feature-level intent. A run's score is the share of that weighted intent a blind reconstructor recovers from the plan. The map plots two axes: planning quality, how sound the plan is on its own terms, against total intent recovery, how much of the original why actually made it through.

Plans that carry their reasoning forward score well. Plans that look complete but quietly lose the why don't.

Watch · the story behind it

How I Actually Used AI Agents to Build a Benchmark Matt Maher · YouTube

CARE is open source — clone it, point an agent at it, and run your own waves.

View the benchmark on GitHub