What counts as evidence: grading the output of a tool-using agent

Blog / May 28th, 2026

When a tool-using AI agent finishes a task, the instinct is to read its summary. The summary is the part the agent writes at the end — the paragraph that ties the queries it ran, the files it read, and the citations it pulled into a single conclusion.

The summary is also the part of the output you can least judge. An agent that swept three tools carefully and an agent that ran one half-hearted query produce summaries that read about the same. The actual material is in the conversation history — the queries that returned data, the tests that passed or failed, the primary sources that resolved to a real URL — but a reviewer would have to read that history end-to-end to rank the report. Most reviewers don't. They read the summary, accept it or push back on it, and move on.

This post is about a different shape for the output. Make findings, not paragraphs, the unit. Have the agent append each finding as it works, with the tool call that produced it and a claimed severity. Then run two mechanical checks — one at record time, one at synthesis time — before the report is rendered. The claims that survive both checks are the ones the reviewer reads first. The claims that fail are still there, still visible, with their downgrade reason attached. We use this approach inside our investigation agent for production incidents, and the rest of this post is the pattern rather than the product, because the pattern is the portable part.

On the left, a single paragraph labelled summary with one phrase highlighted as a strong signal. On the right, a list of structured findings each with a subject, a source, and a severity grade, sorted with the highest grades at the top.

The same agent run, written two ways. A paragraph has one rhythm, a list has structure: each row carries its source and its grade, and the reviewer's eye starts at the top instead of the middle.

The unit is a finding, not a paragraph

A finding is a small record. It carries an identifier, a description of what the agent saw, the tool call that produced it, the subject it concerns, and a claimed severity. It may also carry an evidence block — a handful of typed fields that the agent fills in when its claim depends on a specific property of the data.

{
  id:        "uuid",
  source:    "the tool that produced this finding",
  subject:   "the thing the finding is about",
  observation: "what the agent saw, in one or two sentences",
  severity:  "low | medium | high | critical",
  evidence:  { /* claim-property fields, optional */ }
}

What matters about that shape is not the field list — the field list will look different in every agent. What matters is that the agent appends one of these records every time it learns something, instead of holding the material in conversation and writing a paragraph at the end. The list grows monotonically. The final report is a function of the list, not the last message.

Even before any grading runs, this shift gives a reviewer something to sort. A coding agent doing a PR review can append a finding for every file it changed its mind about, every test that failed, every type error it provoked, every grep that came up empty. A research agent answering a market question can append a finding for every primary source it resolved, every figure it pulled out of a filing, every contradiction between two filings. The reviewer no longer has to read the agent's prose to find out what happened; they read the list. The grading we are about to describe is what makes the list rankable, not just sortable.

Gate one: claim-property evidence at record time

The first gate is local. It applies the moment a finding is recorded, without reference to any other finding. Its job is narrow on purpose: it disciplines the single top tier — the one the agent reaches for when it wants the reviewer to drop what they are doing. Everything else, including the next tier down, is left to the synthesis pass.

Pick a small set of objective properties that have to be true for a top-tier claim. They have to be checkable without an LLM in the loop — a number above a threshold, a boolean set to true, a string with a length greater than zero. In a production-incident agent, those properties might be a user-facing error rate above one percent, a replica count of zero, or a node-level resource-exhaustion signal. In a coding agent reviewing a pull request, they might be a failing test attached to the finding, a type error with a file and line, or a compile error from a specific build target. In a contract-review agent, they might be a quoted clause matching one of a small set of pre-listed red-flag patterns, or a contradiction with a clause earlier in the same document with both clauses' offsets attached.

Then the gate is mechanical: if the agent claims the top tier and none of those fields are filled in, the recording fails. Not "downgrade silently" — fail. The tool returns a short message saying which fields would have satisfied the gate, and the agent has to either attach one of them or claim a lower tier and re-record.

function recordFinding(input) {
  if (input.severity === "critical" && !hasGatingEvidence(input.evidence)) {
    return rejection(
      "severity=critical requires one of: <list the gating fields>. " +
      "Re-call with a lower severity, or attach the evidence above."
    );
  }
  return store.append(input);
}

Two things to notice about this gate. First, it is not a judgement on whether the agent is right — an agent can attach a failing test that turns out to be a flaky test, and the gate has no way to know. The gate's job is to make the strongest claim cost something. The agent has to do extra work to make it. Second, the field set is small and known. Adding more gating fields is a deliberate change to the rubric, not something the agent decides at runtime. If the gating set grows past a handful of fields, it has stopped being a gate and started being a checklist.

The mechanical bit is deliberate. LLM-as-judge evaluation has become the default for grading model outputs, and it earns that place on matters of degree — was this answer better than that one, does this paragraph follow the rubric, is the tone right. Whether a number crossed a threshold is not a matter of degree. Code can check it exactly, faster, cheaper, and without the variance of a model in the loop. Reaching for a judge here would import non-determinism into a question that does not have any. The gates here ask a sharp question and answer it without an LLM.

An agent attempting to record a finding at the top severity. The recording tool checks for at least one gating field set on the evidence block. If a field is set, the finding is stored at the claimed severity. If no field is set, the recording is rejected and the agent receives feedback listing the gating fields and the option to claim a lower severity.

The record-time gate. A top-tier claim that does not carry one of the pre-declared gating fields is refused at the point of recording — the agent gets a structured message and either attaches a field or claims a tier down.

Gate two: cross-source corroboration at synthesis time

The second gate is global, and it runs once, after the agent has stopped collecting. It asks the same kind of question for every elevated finding: is there a sibling that talks about the same subject but came from a different source?

Different sources are what matter. Two findings from the same tool, even with two queries and two observations, are not corroboration; they are one source agreeing with itself. The check is: subject the same, source different, identifier different. If at least one such sibling exists, the elevated finding stays elevated. The grade pass records which siblings backed it, so the audit trail tells the reviewer where the corroboration came from. If no sibling exists and the finding carries no justification (the carve-out for that comes in the next section), the finding gets downgraded to the default tier — its original claim is preserved in a separate field, and a short reason is attached.

for (const f of findings) {
  if (f.severity !== "high" && f.severity !== "critical") continue;
  const siblings = findings.filter(g =>
    g.id      !== f.id      &&
    g.source  !== f.source  &&
    subject(g) === subject(f)
  );
  if (siblings.length >= 1) {
    f.enforced = { severity: f.severity, corroborated_by: siblings.map(s => s.id) };
    continue;
  }
  if (hasJustification(f)) {
    f.enforced = { severity: f.severity, accepted_single_source: true };
    continue;
  }
  f.enforced = { severity: "medium", reason: "no corroborating source, no justification" };
}

The shape of what counts as a sibling depends on the agent. For a production-incident agent, the three independent sources are metrics, logs, and traces, and a "subject" is the service or workload they all talk about. For a coding agent reviewing a pull request, the sources might be the test runner, the type checker, the linter, and a code-search step; a "subject" is a file path or a symbol. For a research agent answering a market question, the sources might be a SEC-filings index, an earnings-call transcript store, and a news archive; a "subject" is a company and a metric. The check is the same in every case — different source, same subject — and it makes a claim that one tool found qualitatively distinct from a claim two independent tools found.

Three findings from different sources about the same subject — two claimed HIGH, one MEDIUM. The two HIGH findings are kept at HIGH, with the sibling IDs of the other findings on the same subject recorded on each. The MEDIUM finding is not itself graded but does count as a sibling for the HIGH ones. A separate finding from a single source about a different subject, with no justification attached, is downgraded to the default tier, with the original claim preserved as a separate field.

The synthesis pass. A claim that three sources agree on becomes a finding the reviewer reads first. A claim only one source backs falls to the default tier — its original wording stays on the record, and the downgrade reason travels with it.

The matching problem, briefly

Two independent sources will rarely refer to the same subject with the same string. A test runner's failure cites a file path that begins at the repository root; a code-search step's hit cites the same file with a workspace-relative prefix; a CI annotation cites it with a build-target qualifier. A metrics tool names a service one way; a logs tool names the same workload with a namespace prefix and a worker suffix; a tracing tool names it the bare way. If the corroboration check compares the raw strings, all three pairs of legitimate corroborators look like findings about different subjects, and nothing is ever graded as corroborated.

The fix is a small normalizer that runs before the comparison. Strip whatever names get mangled by your tooling; pin whatever has multiple representations to one. If two sources cannot be reconciled by a normalizer, give the agent an explicit override field on the finding so it can name the subject directly when its tools disagree. Whatever shape this takes, the principle is the same: the corroboration check has to compare subjects at the level the human would consider equivalent, not at the level the strings happen to land at.

The single-source carve-out

Some claims have only one source, by construction. A custom metric emitted by exactly one exporter, with no log line or trace span that names the same subject, has nothing to corroborate it. A private internal repository has no public document that can cite it. A control-plane component speaks one telemetry channel and no other tool sees it at all. The corroboration rule cannot be absolute, or it would downgrade those claims forever.

The carve-out is small. The finding can carry a justification field — a sentence naming which source would normally back the claim and why no such source exists. The synthesis pass accepts the elevated claim if the justification is present, and the audit trail records that the claim was kept as a single-source case rather than as a corroborated one. Reviewers see the distinction; the report does not pretend a single source is two.

What keeps this from becoming an escape hatch is a length floor on the justification — a minimum number of characters, low enough not to ask anything elaborate of the agent, high enough that a one-character placeholder fails the check. The floor is mechanical, like the rest of the gating: the synthesis pass measures the string, it does not judge it. An agent that tries to bypass the rule with a token of whitespace gets the same downgrade as one that attached no justification at all.

What the report looks like

By the time the report is rendered, every finding carries two fields the reviewer cares about: the severity the agent originally claimed, and the severity the rubric let it keep. The renderer sorts by the second one. The corroborated top-tier findings appear first, each with the list of siblings that backed them. The single-source-accepted findings come next, each with its justification. The downgraded findings come at the bottom, with the original claim still visible and the downgrade reason attached.

That layout is more useful to a reviewer than a paragraph for two distinct reasons. The first is obvious: it tells them where to look first. The second is less obvious: it tells them where the agent reached. A long list of downgraded findings is itself signal — it says the agent was searching aggressively and not finding much. A short list of corroborated findings says the opposite. Both are honest descriptions of what happened, and both are answers the reviewer can act on without rereading the conversation.

What this doesn't fix

The pattern is narrow, and the narrowness is worth stating up front. It does not make the agent right. An agent can attach a failing test that turns out to be flaky, a citation that turns out to be hallucinated, an entropy score that came from a corrupt input. The gating fields are checked for presence and shape, not for truth. If the field is filled in plausibly, the finding is stored at the claim, and the downstream reviewer is still the one who has to verify the substance.

It does not catch shared upstream bias. Two sources that both inherit a poisoned input — the same broken metric scrape feeding both a dashboard and an alert, the same wrong document being summarised by two retrievers — will corroborate each other, and the synthesis pass will see two independent sources where really there was one. The corroboration rule is structural; the structure does not know which sources share an upstream.

It also does not stop an agent from quietly defaulting to the middle tier. If the gates are too tight, a careful agent learns to claim only the safe tier and the report loses the bright top section that made it useful in the first place. The fix is not to loosen the gates; the fix is to choose gating fields the domain actually has, so a correct top-tier claim is something the agent can routinely back. If the domain has no objective evidence available, this pattern is not the right one for it.

What it buys you

Set against those limits is what the structure gives a reviewer for free. The report is sortable, which means the reviewer's eye starts at the strongest claim instead of in the middle of a paragraph. The downgrades are visible, which means the reviewer can tell when the agent reached for something it could not back. The audit trail carries both the claim and the enforcement, which means a regression in the agent — a tendency to over-claim, a tendency to under-attach evidence — shows up as a measurable shift in the ratios rather than as a quiet drift in the prose. None of this depends on the agent being good at writing; it depends only on the agent being willing to record what it sees and on the gates being applied mechanically afterwards.

The portable part of the pattern is the shape: a finding with a claim, a source, and an optional evidence block; one gate at record time that asks for objective fields to back the top tier; one gate at synthesis time that asks for a different source on the same subject. It sits next to the composition patterns Anthropic catalogues in Building Effective Agents — those describe how to structure an agent's control flow and orchestrate its LLM calls; this one describes how it organises what it found. The domain-specific part is the field list. What counts as gating evidence and what counts as a corroborator are things you know about your problem that no general framework can supply. The structure is what makes the answer rankable. What the answer is, the agent still has to find — and the reviewer still has to read. The point is that they get to read it with grades attached.

Share it:

Latest articles

June 4th, 2026 • Evgeny Potapov, CEO, co-founder

The failure that felt normal

An investigation story: a client's AI image feature failed for most of every day, on the same daily schedule, since the day it launched — a Gemini per-day quota that drained early each day. Because it had always behaved that way, the team called it normal and quietly lost the users who hit it. How an ApexData audit found the pattern, and why a defect present from the first deploy is the hardest kind to see.

EXPLAIN, offline — Reconstructing a query plan from collected statistics, with no connection to the database

May 27th, 2026 • Evgeny Potapov, CEO, co-founder

What Postgres knows about your tables

How to predict what PostgreSQL would do with a query without running it — the statistics the planner reads, where they live, and how an offline analyzer reconstructs the plan from collected pg_class, pg_indexes, and pg_stats data, plus the honest boundary between structural prediction and the live cost model.

Bending without breaking — May 15 release: OpenSSL durability, kernel fallback, PHP-FPM self-heal

May 15th, 2026 • Artur Asadullin, Lead Infrastructure Engineer

Release: 2026-05-15 - Bending without breaking: notes from a mid-May agent release

Notes from a mid-May 2026 release of our observability agent. A configurable trace sampling rate and per-process opt-out for high-volume tiers, an OpenSSL interception path that no longer tracks libssl internal layouts, a graceful-degradation mode for older kernels, PHP-FPM monitoring that self-heals across worker recycles, and a cleaner startup log.