Release: 2026-03-17 - Around a slow query: notes from the March 17 release

Elena Kuznetsova, Head of Engineering,

Around a slow query — March 17 release: concurrent requests, offline EXPLAIN, deduplicated logs

A slow query, opened in the platform, used to answer a narrow question: how long did this statement take, and how often. Useful, but rarely enough. The next question is almost always something that the slow-query view did not know how to answer — what else was running on that pod when the query ran, what plan Postgres would have picked, whether this query is slow on every pod or just this one. The March release was largely about closing those follow-up questions inside the same page, so that the path from "this query is slow" to a working hypothesis does not require leaving the platform.

The release also includes a set of changes to the agent that are less visible from a user's seat but are part of the same story. The agent now deduplicates repeated log lines during high-volume incidents, isolates spans with malformed bytes so a single bad payload no longer disturbs a whole batch, and ships HTTP request profiling on the same pipeline as logs and traces. Compatibility and stability improvements landed in the same pass.

What else was running when the query ran

The query detail page now shows an HTTP/RPC requests table — the set of requests handled by the same pod that were in flight while the slow query was executing. The intent is straightforward: when a database call is unexpectedly slow, the question you usually want answered is whether the rest of the pod was busy at the same time, and which request paths were active. That answer used to require pivoting to a different view with a different time selector. It now lives directly under the query.

The interesting design choice in this table is how a request is decided to be relevant. A pod busy enough to have a slow query handles many requests in any reasonable time window, and most have nothing to do with the query in question. The matcher uses interval overlap: a request is shown only if it was in flight while the slow query was executing, which is the definition that matches the colloquial sense of "what else was running." On busy pods, that typically narrows the list to tens of requests rather than hundreds, so the table stays legible even on hot paths.

A slow query interval shown as a red bar at the top, with dashed vertical guides marking its start and end. Below, request bars are matched only if they overlap the query's interval; bars falling entirely before or after the query are dimmed.

One caveat worth flagging: this only tells you what was in flight on the same pod, not what was upstream of the slow query in particular. Two requests can be running at the same time without being causally related. The table is a starting point for a hypothesis ("was this query slow because the pod was saturated handling these other requests?"), not a conclusion. Anything stronger needs the trace.

Two smaller improvements landed in the same view: the Slowest Queries table is now sortable by any column (time, duration, service, namespace, deployment, pod, host, node), defaulting to duration descending. And the cross-pod latency chart now has a toggle between "this query" and "all queries" — the same view either compares this specific statement's latency across pods, or compares all SQL activity. The point is to answer "is this query slow everywhere, or only on this pod" without leaving the page.

Reading the plan before running it

The release adds a SQL Explain step on the query detail page. Click it on a slow query, and the platform reports what Postgres would likely do with that statement — which tables it would read, whether it would use an index or a sequential scan, which index, and how many rows it would expect. It also flags a few common patterns that defeat indexes: a sequential scan on a large table, a function wrapped around an indexed column, a predicate column with no index at all.

The detail that makes this less obvious than it sounds is that the platform does not run EXPLAIN against the production database to produce this. There is no connection back from the platform to your Postgres at query time. Instead, the platform uses the schema and statistics that the agent has already been collecting — table sizes, index definitions, column statistics, value distributions — and parses the query using PostgreSQL's actual parser, exposed as a portable C library via libpg_query, which vendors the server's parser source code so it can run outside a database process. With the parse tree and the collected statistics in the same place, the platform can predict the plan choices for each table independently of the running database.

Query text and collected schema feed into a static analyzer that produces predicted scans and warnings.

Predicting a plan offline is not the same as running EXPLAIN, and it is worth being explicit about the difference. The real planner has live statistics, current cache contents, and a cost model that this analyzer does not reproduce in full. What the analyzer gets right is the structural part: which columns are involved, which indexes exist on those columns, whether the query is shaped in a way that would let Postgres use them. That is the part of the plan that most "why is this query slow" investigations are actually trying to recover. It also catches a class of issue you would otherwise have to notice yourself — a date function wrapping an indexed timestamp column, a comparison against a column that has no index, an OR where one side has no usable index and forces a sequential scan that the indexed side could have avoided — and points at it rather than burying it inside a longer plan.

Log storms, deduplicated

The agent now deduplicates repeated log lines within the same container. Two lines are treated as the same when their pattern matches — variable parts like timestamps, request IDs, durations, and per-request identifiers are ignored, and the structural skeleton is compared. The agent forwards a single entry with a count of how many times the pattern repeated in the window, instead of forwarding the line again and again. The same treatment applies to Kubernetes events on the cluster watcher.

This matters during the periods where logs matter most. A typical incident shape is "one process starts failing, the failing process logs the same line several times a second, and the log pipeline that you want to be reading right now is busy carrying thousands of copies of that one line." Deduplication does not change what the log says; it changes how many bytes the pipeline carries to say it. On the storms we have observed in our own environments, the volume reduction is large enough that the line of interest becomes legible again instead of being one entry in a wall.

The thing to keep in mind: a count of 4,217 occurrences is not the same as four thousand individual log lines, even when the lines are the same. A few investigations want the exact sequence — for instance, when correlating two unrelated noisy lines that interleave. The dedup view shows the count and the first and most recent timestamps so the time span is preserved; the original raw stream remains available on the per-container view for the cases where the sequence matters.

Request profiling, on the same pipe

HTTP request profiling — the per-request breakdown of where time was spent (application, database calls, downstream HTTP, the rest) — is now exported on the same OTLP log pipeline as everything else. Before this release it was a separate path; now there is one transport, one set of TLS settings, one sampling story, one place to look when an export goes wrong. Operationally, that is one fewer thing to think about, and one fewer thing to monitor.

The other change in this section is sampling. Under high load, request profiling now samples rather than exporting every record — the same way traces have always been sampled. This is the right default: in steady state you do not need every request's profile to answer "where does this URL pattern spend its time"; you need enough to keep the percentile estimates honest. The sampling rate is adaptive, so the profile volume stays bounded even when traffic spikes. The cost is the obvious one: under very high load you will not have a profile for every request. That is the deliberate tradeoff against the alternative, which is unbounded profile volume exactly when the pipeline is most loaded.

When a span has bad bytes

OTLP requires string fields to be valid UTF-8. The exceptions tend to come from applications that put binary data into a span attribute — a corrupted Redis key, a serialized blob accidentally logged, a payload from a service that does not enforce encoding. Spans with invalid UTF-8 in their attributes are now isolated and dropped individually rather than affecting the surrounding batch.

The mechanism is binary-split retry: when the collector rejects a batch, the agent splits the batch in halves and retries each, recursing until the offending span is on its own; that span is dropped, the rest goes through. The tradeoff is the right one — at most one span is lost per bad payload, and the retry cost is logarithmic in batch size, which is small in practice. Each rejected span is recorded so the loss is auditable.

Compatibility and stability

A handful of smaller items shipped in the same release. The update command now preserves the agent's run state across updates: a stopped agent stays stopped after an update completes. Memory stability of the HTTP/2 parser is improved on hosts with long-lived HTTP/2 connections. The OpenSSL uprobe path now supports the current OpenSSL 3.2.x and 3.3.x struct (ssl_st) layouts, so PostgreSQL child spans on TLS-secured connections remain captured across recent OpenSSL upgrades. And PostgreSQL detection now also covers the sendto and recvfrom syscall path — which is how libc's send and recv are actually implemented on Linux — broadening coverage across client libraries and language runtimes that route socket I/O through those calls.

What this release was about

Two threads run through this release. One is making the platform answer more of the questions that come up immediately after "this query is slow" — what else was running on the pod, what plan the database would pick, whether the problem is local to this pod or general. That thread brings data the platform already had into the same view, and adds one new piece of analysis on top of data the agent had already collected. The other thread is sharpening the agent itself — deduplicating repeated log lines during high-volume incidents and isolating malformed data without disturbing the rest of a batch. The connection between the threads is that the first one assumes the second. Slow-query analysis is only as good as the data it sits on top of.

Latest articles

Get started.
See your systems clearly.
Ship faster

By clicking 'Get Started', you're agreeing to our Privacy Policy

Robot with a looking glass