Release: 2026-05-15 - Bending without breaking: notes from a mid-May agent release

Blog / May 15th, 2026

An agent that runs on every node of every cluster has to adapt to its host. Kernels are older than the documentation pretends. OpenSSL ships a minor release every six months: the public API is stable across minor releases, but the internal layout of opaque connection objects can change between any two, and several have. Application servers recycle their workers on a schedule, and the agent finds out about it through events that may or may not arrive in order. The shape of this release is not "what new things does the agent see" so much as "what kinds of host conditions does the agent now handle in stride."

Five changes are worth a paragraph each: a sampling control for tracing on high-volume tiers; a new way of decrypting OpenSSL traffic that does not need to keep up with libssl's internal reorganisations; a graceful-degradation path for kernels that do not support the agent's full kernel-level instrumentation; a self-heal in PHP-FPM monitoring across worker recycles; and a cleaner startup log so the interesting messages are not buried.

Trace sampling and per-process opt-out

The agent now exposes a configurable trace sampling rate, --traces-sampling, with a default of 1.0 — every L7 span is emitted. The control matters on the largest web tiers, where a single host can emit tens of thousands of spans per second from request-level instrumentation alone, more than the downstream pipeline cares to keep. Operators can dial sampling down to a fraction without disabling tracing, and metrics and logs are not affected by the rate: only the trace channel narrows.

Some processes never want trace spans at all — batch jobs, internal chatter between sidecars, health-check probers that exist only to poke. A per-process environment-variable opt-out, set in the application's pod spec or systemd unit, removes that one process from tracing while leaving its metrics and logs intact. The opt-out is read once per process at first contact and cached for the process's lifetime, so it costs nothing in steady state.

$A flow that shows L7 events flowing into a sampling gate. At rate 1.0 every event becomes a trace span. At a lower rate a fraction survive. Processes with the environment opt-out are routed past the sampling gate and emit no spans at all, while still contributing to metrics.$

OpenSSL interception that survives version drift

To capture decrypted traffic on a TLS connection, the agent attaches to the entry points of the OpenSSL library — SSL_read and SSL_write, the userspace functions an application calls to send and receive plaintext through the TLS pipe. The catch is finding the file descriptor those entry points operate on. Their signature does not carry one — it carries a pointer to OpenSSL's own connection object, which holds the socket somewhere inside.

The previous design read that socket directly out of the OpenSSL object's memory layout. That worked, but the layout is private to OpenSSL. Maintaining the path meant carrying separate descriptors of the layout across the live 3.0, 3.2, 3.3, 3.4, 3.5, and 3.6 series, plus the still-deployed-but-EOL 1.1.1, with another descriptor due whenever a new minor lands. The friction was not in any one version; it was in the steady cost of chasing a moving target.

This release replaces that approach with one that does not look inside OpenSSL at all. The plaintext entry points still capture their buffer and length, but the file descriptor is filled in by correlating with the underlying read or write syscall that OpenSSL itself eventually performs to move the bytes onto the socket. The kernel knows which descriptor that syscall used; the agent reads it from there. The OpenSSL object's layout becomes irrelevant — every libssl version the agent attaches to, from 1.1.1 through the current 3.x series, is covered by one mechanism, and future minor releases do not require a new descriptor.

A comparison of the two approaches. On the left, the agent reaches inside the OpenSSL connection object to read the file descriptor and has separate paths for the 1.1, 3.0, and 3.2 layouts. On the right, the agent observes the plaintext entry point and then waits for the matching read or write syscall to record the file descriptor from the kernel side.

The honest part of this story is that the change is large for one feature: the kernel-side capture is a different mechanism than the userspace dereference it replaces, and the rollout had to keep up with the existing inode-level deduplication that already amortises one attachment across all processes sharing the same libssl. The reward is that every OpenSSL minor release stops being a release-blocking event for the agent — and the same is true for the next one.

Graceful start on older kernels

The agent makes heavy use of kernel-level instrumentation through eBPF, which is the cheapest way to observe network and L7 traffic without slowing the host down. A few of the agent's BPF programs require features that landed in Linux kernel 5.6 — on older kernels the verifier rejects them at load time.

On those hosts, the agent now starts in a documented degraded mode instead of failing closed. It logs which kernel features were unavailable and continues with the file-system-based collectors: process discovery, systemd-service classification, host metrics, syslog ingestion, and Kubernetes event capture all stay on. L7 capture and TLS decryption are skipped, because they fundamentally need the kernel-level path, and the startup log says so plainly.

A flow diagram. The agent attempts to load its kernel-level programs. On success, every subsystem is available. On failure, the agent logs which features are unavailable, skips the kernel-dependent subsystems, and continues with file-system-based discovery, host metrics, syslog, and Kubernetes events.

A second change in the same spirit applies to the standalone agent binary. It is now compiled against glibc 2.31 (the Debian 11 toolchain), so it installs cleanly on Ubuntu 20.04 (glibc 2.31) and Debian 11 (2.31) — among the older enterprise baselines still in broad production — without manual workarounds.

PHP-FPM self-heal across worker recycles

PHP-FPM is structured around a pool of worker processes that handle requests concurrently. When an operator sets pm.max_requests — typically a value in the hundreds or low thousands, since the default of 0 leaves workers in place indefinitely — the pool retires a worker after it has served that many requests and starts a replacement. On a busy host, a worker is born, serves its quota, and is retired roughly every few minutes. The agent's PHP-FPM request metrics need to register every one of those new workers as it appears.

On busy hosts, workers can recycle faster than process-start events can always be delivered through the kernel's bounded event buffer. The agent now reconciles a worker's bookkeeping on the spot the first time it sees that worker's network traffic, querying the kernel directly for the process's accounting information and installing the missing record. A second, slower sweep runs on a timer and walks the process table for anything that has slipped through. Both paths are bounded so they cannot become a load source of their own.

A pool of PHP-FPM workers. Worker A retires after reaching its request quota. Worker C is created. The kernel emits a process-start event into a bounded buffer that occasionally drops. The agent, on receiving worker C's first network event, observes the missing record and reconciles it.

The result is that PHP-FPM request rates stay accurate across long-running hosts, regardless of how often workers cycle underneath them.

Cleaner startup, continued

The startup log is one of the few places where an agent has the operator's attention. A startup log that buries the interesting line in eight pages of routine output trains people to scroll past it. Recent releases have been trimming this back: log lines from optional integrations no longer report as errors when the optional thing is simply absent, and the messages that survive are the ones a reader might act on.

This release adds one more thing to that pattern. At the end of the agent's initial discovery window — the few seconds where it walks the host and figures out what is running on it — the agent now prints a one-line summary of what it found, grouped by container type, with a handful of example identifiers per group. The dashboard already shows this information later, but having it in the agent's own log makes the discovery phase legible without scrolling, and makes it obvious whether the agent has seen what an operator expected it to see.

Resilience as a feature

This release is about resilience. The agent now operates across a wider range of host conditions — high-volume tiers that need a sampling control, OpenSSL releases that arrive every six months, kernels older than the agent's preferred path, application servers that recycle workers faster than events arrive, startup logs that benefit from a one-line summary instead of pages of routine output. Each of these is a place where the host environment varies more than a single fixed implementation can absorb, and the agent now meets that variation directly: with a knob the operator can turn, a mechanism that does not depend on private library internals, a graceful path through reduced kernel capability, a self-reconciling worker record, and a clearer record of what was discovered at startup.

Most of what an observability agent does is invisible until the day someone needs the data. The work in this release widens the range of conditions under which that data stays accurate and the agent stays out of the operator's way — so the data is there, ready, when the day comes.

Share it:

Latest articles

June 4th, 2026 • Evgeny Potapov, CEO, co-founder

The failure that felt normal

An investigation story: a client's AI image feature failed for most of every day, on the same daily schedule, since the day it launched — a Gemini per-day quota that drained early each day. Because it had always behaved that way, the team called it normal and quietly lost the users who hit it. How an ApexData audit found the pattern, and why a defect present from the first deploy is the hardest kind to see.

May 28th, 2026 • Andrey Shamakhov, CTO, co-founder

What counts as evidence: grading the output of a tool-using agent

A pattern for tool-using AI agents — make findings the unit of output, gate the top-tier severity claim on objective evidence at record time, and require cross-source corroboration for any elevated claim at synthesis time. The agent's report becomes something a reviewer can rank.

EXPLAIN, offline — Reconstructing a query plan from collected statistics, with no connection to the database

May 27th, 2026 • Evgeny Potapov, CEO, co-founder

What Postgres knows about your tables

How to predict what PostgreSQL would do with a query without running it — the statistics the planner reads, where they live, and how an offline analyzer reconstructs the plan from collected pg_class, pg_indexes, and pg_stats data, plus the honest boundary between structural prediction and the live cost model.