Tree-search agents: building an AI agent that survives a wrong first guess

The cheapest thing about an AI debugging session is the part where the first hypothesis is right. You ask the model what is wrong with the service, it guesses, the guess matches the bug, and forty-five seconds later you have a fix. Most of the time, that is not the session you get.

The session you usually get is the one where the model's first instinct lands on a plausible-but-wrong cause. It spends five minutes pulling data to confirm the theory, the data does not confirm it, and the conversation is now full of context relevant only to a dead branch. You back up. The next attempt is starting from a worse place than the first one did. The hypothesis was wrong, which is no one's fault. The cost of being wrong, serially, is what gives this kind of session its bad reputation.

We work on ApexData, an observability product built for development teams — engineers who write the application code, not the people running the cluster, and who want to know what is happening in production without learning a separate ops toolchain to find out. The piece of the product worth writing about here is the one that takes a live incident — a user complaint, a graph bending in the wrong direction, an alert that fired without context — and explains, in language someone reading application code can act on, what is actually going on. That piece is an AI agent.

This post is about the design moves that helped most when we built it, and why each of them was a response to a specific failure mode in the simpler version of the loop. Most of the patterns below — branching hypotheses, specialised subagents, evidence-scored evaluation, bounded search — are well trodden in the agent literature, applied here to a particular operational shape: real-time investigation across metrics, logs, and traces. We are writing it down because the shape generalises. If you are building an agent in a different domain that has to investigate noisy data and produce a defensible answer, some of these moves should transfer directly. Where the choices are constrained by our domain, we will say so.

The shape of the problem

The standard linear pattern — interleave reasoning and tool calls until you converge, in the spirit of ReAct (Yao et al., 2022) — works well when the hypothesis space is small and the first guess the model lands on is approximately right.

In incident triage the first hypothesis is approximately right less than half the time. There is no shortage of failure modes that produce similar surface symptoms. Slow API responses can come from saturated CPU, an exhausted database connection pool, a noisy neighbour on a shared node, a backpressure spike in an upstream queue, a slow downstream service whose latency is propagating, or a hot index on a single table. The first guess from the model is usually one of these, and the other five typically are not visited at all. Most loops do not pivot; they keep refining the first hypothesis until it works or they run out of budget.

The cost is subtle. Each wrong tool call is a few thousand tokens of irrelevant data getting glued to the context. By the time the model gives up on a wrong hypothesis, the context is full of irrelevant memory metrics, irrelevant traces, irrelevant logs from a service that turned out not to be on the path. Starting over is not free either; some of that material is going to leak into the next attempt.

Two design moves address this directly. The first is to generate hypotheses in parallel, so being wrong on one branch does not invalidate work on another. The second is to give each domain its own specialist agent, so noisy data from one domain does not pollute the context of another. Most of the rest of this article is about applying those two moves carefully.

A wrong first hypothesis on the left wastes the rest of the chain. On the right, sibling branches are still alive when the planner re-scores.

Tree of hypotheses, not a chain

The investigation runtime starts each session by generating multiple candidate root causes in parallel rather than picking one. Each becomes a branch in a tree, explored independently. If branch A turns out wrong, the work done on B is not invalidated.

The shape is borrowed from Language Agent Tree Search (Zhou et al., 2023; ICML 2024). The original paper applies a Monte Carlo Tree Search variant to language-agent action spaces: at each node the agent expands a set of next moves, scores them by how well they advance the task, and selects which to expand next. Our variant is simpler. The "actions" are hypotheses about the cause of an incident, and the "score" is how well the evidence collected so far supports that hypothesis.

One round of the loop looks like this:

Initialise. A planning agent generates three initial hypotheses from the user's question by default. Each carries a focus field — a short investigation brief in plain English — and a rationale for why this hypothesis is plausible.
Select. From all pending nodes, pick the one with the highest selection score. The textbook UCT (Upper Confidence Bound for Trees) formula is mean_reward + C * sqrt(ln(parent_visits) / visits), where the exploitation term is the mean of rewards seen in the subtree. We use a UCB1-shaped score over a max-aggregated value instead — the exploitation term is the best score seen in the subtree, not the average. That trades UCT's regret guarantees for a best-of-N flavour, which fits incident triage where we only care about the strongest hypothesis, not the average one. Unvisited nodes get a large exploration bonus so every node is investigated at least once before the search revisits.
Investigate. Hand the node's focus to the orchestrator subagent. It runs metrics, logs, or traces queries — whatever the hypothesis calls for — and records findings into a shared store as it goes.
Evaluate. Hand the findings back to the planning agent and ask it to score how well they support the hypothesis. The output is a structured object: { score, reasoning, keyEvidence, gaps }. The gaps field, in particular, is what drives the next step.
Backpropagate. Walk up the tree, updating each ancestor's score to the maximum of its completed children rather than the mean. This is the deviation from textbook UCT mentioned above; if a sibling later beats this branch, the parent's score moves up to track it. Update the global "best" pointer if this run beat the previous leader.
Expand. If the score is above 0.3 and we have not hit max depth, the planning agent generates child hypotheses that dig deeper into the branch. The siblings already explored at this level are passed in as context to discourage redundant expansions.
Stop. If the best score crosses the high-confidence threshold (0.85 in our setup), the run ends early. Otherwise it continues until the budget is exhausted (six rounds default) or every node has been visited.

A short note on the constants in here: 0.3, 0.85, six, ten, fifteen. These are values we landed on after a small number of internal sweeps on our own incident corpus. They are not derived from first principles, and we would not expect them to be portable to a different domain or a different model — anyone copying the structure should expect to re-tune the gates against their own evals. With that caveat, a few of the choices look subtle and compound.

The 0.3 expansion gate matters. Without it, every hypothesis spawns children regardless of how badly it underperformed, which means the search ends up exploring siblings of confirmed dead ends. With it, the budget concentrates on hypotheses that already showed traction.

Sibling-aware expansion matters too. A planning agent that does not know what its siblings explored will generate variations of the same direction. Passing in the existing sibling hypotheses reliably broadens the spread of children, even at the same temperature.

The UCT formula is the part most easily skipped. Without it, search collapses onto whichever node looked best on round one, and the tree degenerates into a chain again. UCT keeps low-scored nodes alive long enough to get a second look — which is the whole reason for going to a tree in the first place.

One specialist per branch, not a generalist

Each branch's investigation is run by an orchestrator that delegates to three specialised subagents:

Metrics agent. PromQL against VictoriaMetrics. Knows the RED method (rate, errors, duration) for service health and the USE method (utilisation, saturation, errors) for resources. Knows that counters need rate() to be meaningful, that histogram quantiles need histogram_quantile() over a rate() of the bucket series, and that gauges can be queried directly. Loads a metrics catalogue skill on every run before doing anything else.
Logs agent. Full-text and regex search across ClickHouse, with routing to Manticore for recent windows. Knows the OpenTelemetry log schema columns and the difference between search_logs (text search) and grep_logs (regex).
Traces agent. Span queries for latency outliers, error spans, and slow spans. Knows how to walk a trace and how to aggregate spans without missing the long tail.

The reason for the split is not that one model could not handle all three with the right prompt. It is that the noise profiles are very different, the tools you need to query each are very different, and an agent that is good at one tends to be mediocre at the others. Anthropic's multi-agent research system made the same observation in 2025: subagents with their own context windows and their own tool sets reduce context pollution and let each one carry domain-tailored expertise.

The concrete benefit at our end shows up on a few axes:

The metrics agent's prompt can spend its tokens documenting label conventions and metric type rules, which would be wasted on a logs question.
A query that times out or returns empty in the logs agent does not fill up the metrics agent's context with garbage.
When the orchestrator dispatches all three subagents in parallel, wall-clock cost is the slowest agent rather than the sum, which is up to a 3× speedup in the limit. The orchestrator waits for all three to record findings — the shared store is where synthesis happens, not the first message back — but the work proceeds concurrently.

One design pitfall is worth flagging. The subagents have to share findings, otherwise they cannot cross-reference each other's evidence. We use a shared findings store keyed by [investigationId, 'findings'] that every subagent writes to via a record_finding tool. A finding is an append-only record: which agent recorded it, what the query was, what the result was, what time window it covered, what unit it was in. The orchestrator and the planning agent read from this store when synthesising. Without that shared substrate, "evidence" is just whatever happens to be in the conversation at synthesis time, and the synthesis prompt ends up reconstructing facts from prose.

Three specialists, each with its own runbook skills, writing into one shared findings store.

Skills, not system-prompt walls

The metrics agent does not know about every metric in our catalogue from prompt alone. It loads a metrics-catalog skill on every run — a markdown file mounted into a virtual filesystem under /skills/ — that documents metric families, types, label conventions, and a dynamic-prefix system for client-discovered metrics. Other skills cover service health, resource usage, anomaly detection, database health, kubernetes basics, and a PromQL reference.

Skills, as a public pattern, were articulated more explicitly in Anthropic's late-2025 writeup: structured folders of instructions and scripts that compose into agent capability without inflating the base prompt. The mechanic is progressive disclosure. The agent loads the skills relevant to the task it is on, not every skill the system knows about.

The concrete advantage is keeping the prompt short enough to think in. The metrics agent's base prompt stays at a couple of hundred lines because the skills carry the bulk of the domain knowledge. Adding a new metric family is a markdown commit, not a prompt edit, and the prompt regressions that come from trying to cram new rules into an already-large system message stop happening.

Cross-pollination is intentionally limited. The metrics agent does not see the logs runbooks, and vice versa. The fewer paths a subagent has into knowledge that is not its own, the easier it is to keep its behaviour consistent across runs.

Bounded by design

The LATS budget is six nodes by default. We have run with higher numbers. We have not measured this rigorously, but in our internal sweeps doubling the budget from six to twelve nodes did not visibly improve final-answer quality on our eval set, while it roughly doubled token cost and wall-clock time. The trade was not worth it for us, and below ten or so nodes the curve flattens out fast enough that we did not invest in tuning further.

A runaway investigation is expensive in three ways:

Tokens. Each branch is a full subagent run, often tens of thousands of tokens in tool calls, evidence storage, and evaluation overhead.
Wall clock. A twelve-round investigation that occasionally lands on a better answer than a six-round one is, in practice, an investigation telling the on-call engineer "we are still thinking" while their pager is firing.
Cumulative confusion. The longer a tree gets, the more chances there are for the planning agent to start expanding marginal branches just because they exist. Past a certain depth, the new hypotheses are mostly variations of variations.

The accuracy curve hits diminishing returns earlier than people guess. You can also stop a run yourself at any point — an AbortSignal is wired through the engine, the orchestrator, and the LangGraph stream, so cancellation propagates immediately.

Inside a single subagent run, there is a separate budget enforced by what we call the convergence middleware. Its job is to stop one subagent from running forever inside one node:

After ten model calls in a single run, the middleware injects a synthetic human message: "You have explored enough. Record any missing findings and wrap up."
After fifteen, it forces an end and writes a final assistant message explaining the run was cut short for budget reasons.

The synthetic-human framing is deliberate. A system message at this point reads as a constraint the model can rationalise around; a human message reads as an instruction it has to comply with. In informal testing the human-message variant wrapped up cleanly more often than the system-message version, though we have not run a controlled evaluation.

Confidence scored from evidence

Every hypothesis gets a score based on what the agents actually found. This sounds obvious, but it is the place a lot of agentic systems quietly conflate two different quantities. Plausibility of the hypothesis (how reasonable it sounded a priori) and weight of the evidence (how much the data confirms it) are different things. Conflating them is how an agent ends up doubling down on its first guess.

The evaluation prompt asks for four things, in JSON:

{
  "score": 0.62,
  "reasoning": "...",
  "keyEvidence": ["...", "..."],
  "gaps": ["...", "..."]
}

The score and reasoning are what they look like. keyEvidence is the explicit list of findings the planning agent relied on for the score. gaps is what would still need to be confirmed for the score to go higher. Scores update live during the run, with a "best" badge on the current leader, and the leader can change. A hypothesis that scored 0.7 on round one can drop to 0.4 on round three when a sibling turns up evidence that contradicts it.

The gaps field has a second job. When a node is expanded, the planning agent uses gaps to brief the children — "here is what we still need to know to confirm or refute this branch." Children frequently end up running queries that directly fill the gaps their parent left, instead of redoing the parent's work or speculating in a parallel direction.

The loop. Every iteration starts from the highest-UCT pending node and ends at a stop check, not at 'the model decides to stop'.

Configurable models

A separate settings tab lets you choose which models power investigations — separate picks for thinking and execution work, across Anthropic, Google, OpenAI, and OpenRouter, with automatic fallback if the primary provider rejects.

The split is not just cost. Subagents run tools in tight loops, and a faster execution model produces a different agent — one that explores more aggressively in the same wall-clock budget. The thinking model only runs at decision points (initial planning, evaluation, expansion, synthesis), where careful reasoning matters more than throughput.

The fallback layer is the part of this most people get subtly wrong. Wrapping invoke() alone is not enough. Tool-calling agents call bindTools() first to register their tools, then invoke through the bound model. Structured-output chains call withStructuredOutput() to attach a schema. If your fallback only wraps invoke(), the bound and structured paths bypass it entirely. We override both — bindTools() and withStructuredOutput() are patched on the primary model so that whichever path the call takes, the fallback chain exists. If the primary throws a retryable error, the patched method retries on the fallback model.

The fallback is intentionally cross-provider. Anthropic falls back to Google, Google to Anthropic, OpenAI to Anthropic. A retry on the same provider is not a useful fallback if the failure is a provider-wide incident. OpenRouter is itself a multi-provider aggregator, but the failures we have actually seen on it have been OpenRouter-side rather than upstream provider-side, so we fall it back to direct Anthropic — a different network path and credential pool, not a different model family.

What the output looks like

A run ends with a synthesised explanation backed by the specific metric values and log lines the agents pulled, with concrete observations rather than generic "could be memory pressure, consider investigating further."

The synthesis is itself an LLM call, but it does not have a free hand. It receives:

The user query.
The full hypothesis path from root to best node.
Every finding the subagents recorded along that path, with timestamps, exact queries, and time windows.
The evaluation reasoning at each step.

The prompt asks for a structured incident narrative: summary, timeline, evidence. The "evidence" section is not summarised prose — it cites the exact query that produced the value, the exact log line excerpts the logs agent pulled, the trace IDs the traces agent flagged. If a reader wants to verify something the agent said, they can re-run the query.

One related decision worth calling out: the prompt explicitly tells the synthesis model not to suggest fixes or owner assignments. The output is a data-backed report, not an action plan. Recommendations bias the reader toward whichever fix the model thought of first; in our experience that is exactly the kind of confidence the model should not be projecting at this stage.

Lessons for engineers building investigation agents

Here are the moves that compounded most for us, ranked roughly by how much each paid off relative to how much it cost to implement:

Generate hypotheses in parallel and let evidence pick the winner, not plausibility. This is the single biggest behavioural change between a chain-of-thought agent and a tree-search agent. It costs roughly two to three times more tokens on the easy cases, because you explored branches you did not need to. It saves the hour-long sessions on the hard ones, because the right branch was already started while the wrong ones were running. The tail of session lengths matters more than the median; the trade is favourable.
Score on evidence, not on prior plausibility, and re-score every round. The scoring step is what stops the planner from doubling down. Returning gaps as part of the score is what gives the next round of expansion something specific to do.
Make subagents domain specialists with their own skills. Different noise profiles, different tools, different prompts. A logs question on the metrics agent's context is exactly the kind of waste that compounds badly across a long run. Skills (or any progressive-disclosure mechanism) are how you keep each specialist's prompt small enough to reason in.
Bound everything. Per-search budget, per-subagent budget, per-tool retry. The rare case where the agent would have produced a better answer with more iterations is much rarer than the common case where it would have just produced a more confused one. Inject a wrap-up signal as a synthetic human message before the hard cap, not as a system message.
Patch the model below bindTools() and withStructuredOutput(), not above them. If your fallback only catches top-level failures, the structured-output and tool-binding paths will silently skip it. This one is easy to miss because most code paths look correct in single-provider testing.
Treat findings as a typed, append-only artifact. A shared findings store is what stops the synthesis prompt from having to reconstruct evidence out of free-form conversation logs. It looks like a small detail until you have to write the synthesis prompt; then it looks like the only sensible way to have done it.

The smallest lesson worth keeping from any of this: in a real incident, you do not know which hypothesis is correct, and you cannot afford to be wrong serially. Build the loop around that, and most of the rest of the design follows.