Grogh 0.17782: The cost of watching — notes from an observability agent release

Artur Asadullin, Head of Infrastructure,

The cost of watching — Grogh 0.17782: halved GC overhead, shared OTLP connection, leaner eBPF probes

An observability agent that eats the resources of the server it is supposed to watch defeats the point. The agent we work on — a single Go binary that runs on every node, reads kernel events through eBPF ring buffers, and ships metrics, logs, and traces to a collector — is bound by that constraint at every step. Every feature it gains has to be weighed against the cost it imposes on the host. The latest release was largely about that calculation: lowering the cost the agent pays for its existing visibility, and using the headroom to close a few gaps that had been quietly hurting investigations.

This post walks through the changes that mattered most. None of them are dramatic in isolation. Several of them are the kind of change that only matters because the agent runs on hundreds of nodes for years; what looks like a small allocation in a hot path becomes the difference between an agent holding 180 MB and an agent holding 400 MB after a month of uptime. Where the underlying patterns generalize — pooling buffers, sharing connections, deleting telemetry nobody reads — we have tried to say so. Most of them apply to any long-running telemetry pipeline, not just to ours.

Less garbage, faster parsing

The first set of changes was about garbage. Go's GC is good, but its cost is proportional to the rate at which heap objects are created and dropped, and the agent's hot path — eBPF events arriving at tens of thousands per second — was creating a lot of them. Three techniques accounted for most of the gain.

Object pooling for short-lived buffers. Anywhere the parser needed a temporary []byte, a small struct, or a bytes.Buffer, the per-event allocation was replaced by reuse through sync.Pool. The pattern is well documented and has obvious traps — the most common is forgetting to reset state before returning the object to the pool, which leaks data from one call into the next — but applied carefully, it removes most of the per-event allocation pressure. Public benchmarks of the same technique typically report 20–40% reductions in allocation pressure under load (Go Optimization Guide); on the agent's log parser the gain landed at 56% fewer allocations and 38% lower wall-clock time, because the parser was previously allocating multiple buffers per record.

Stack allocations. Go's escape analysis decides whether a value lives on the stack or the heap based on whether anything outside the function can observe it. Returning a pointer, capturing in a closure, or storing into an interface all force a heap allocation. Reading the escape analysis output (go build -gcflags='-m=2') for hot paths, and rewriting code so small structs stay on the stack, is unglamorous but cumulative. A lot of "unnecessary" heap usage in Go turns out to be one interface assertion away from going on the stack.

Zero-allocation parsing paths. The parsers that walk eBPF event payloads — HTTP, Postgres, Redis, MySQL, MongoDB — used to allocate intermediate slices for headers, methods, query templates. Most of those slices can be windows into the original payload buffer, not copies of it, as long as the caller is careful about lifetimes. Where the caller has to retain the result past the lifetime of the buffer, an explicit ownership transfer makes the copy explicit instead of accidental.

The cumulative effect was that GC overhead on the hot path roughly halved, settling at around 5% of CPU time on production hosts (measured via runtime.MemStats.GCCPUFraction aggregated over a couple of weeks). That is not a microbenchmark number; the agent runs the same workload, allocates roughly half as much, and gets out of the host's way more often. The other observation worth keeping is the order things were done in: pooling first, then escape analysis, then zero-alloc parsing. Pooling has the highest gain per hour spent. Escape analysis takes longer to read than to fix. Zero-alloc parsing is where the lifetime bugs hide, and is best done last, when the rest of the path is already stable.

Fewer, cheaper network pipes

The next set of changes was about the network. The agent ships three signals — metrics, logs, and traces — to a downstream collector over OTLP. Until this release, each signal had its own gRPC connection: three TLS handshakes, three idle keepalives, three sets of connection bookkeeping per agent.

OTLP defines a separate RPC service per signal, but nothing about the protocol requires a separate channel — multiplexing distinct services over a single HTTP/2 connection is a property of the gRPC transport (OTLP spec). Most language SDKs default to one channel per signal for compatibility; the OpenTelemetry Go SDK happens to expose the underlying grpc.ClientConn on each exporter and lets you pass an existing connection in. That is what the agent now does — one connection, three exporters bound to it.

Three separate gRPC channels become one. The signal streams stay distinct at the RPC layer; HTTP/2 multiplexes them inside the same connection.

The other change in this section was that gzip compression is now on by default across all OTLP exporters. The OTLP wire format is compact already — protobuf with a small attribute table — but most telemetry payloads have a lot of repeated tokens, like service names, attribute keys, and label values. gzip squeezes 70-90% of bandwidth out of that kind of payload at modest CPU cost (OTLP exporter spec on supported compressions). The compression is negotiated through gRPC's built-in mechanism, so no custom plumbing is needed; turning it on is a configuration change, not a code change.

The practical lesson, if you operate an OTLP pipeline of your own: check whether your exporters share a connection and whether they compress. Many SDK defaults still favor the historical one-channel-per-signal layout for compatibility, and gzip is off by default in some language bindings. Both settings are typically a few lines of code to flip.

eBPF, slimmer

The eBPF subsystem is the agent's most expensive surface, because it runs in kernel context and copies events into userspace at the kernel's pace. The cheapest event, by a wide margin, is the one that never crosses that boundary. Three changes dropped a lot of crossings.

Probe-level accounting, applied as an audit. The agent now records per-probe accounting — events emitted, events parsed, downstream consumers attached — so every probe carries a visible cost-vs-value signal. The first time we ran that audit on a release candidate it retired two probes whose downstream consumers had been removed in earlier work: a scheduler-latency tracer and a TCP congestion-window snapshot. Without per-probe metrics, dead probes are invisible from the dashboard side — nothing about the absence of a downstream consumer changes how the probe itself looks — which is exactly the gap the audit closes. It now runs on every release, not just this one.

OpenSSL uprobes switched from per-PID to inode-level. Userspace probes (uprobes) intercept calls to library functions; for libssl the relevant entry points are SSL_read and SSL_write, which the agent uses to capture decrypted L7 payloads from TLS traffic. Per-PID attachment — one attachment per process that loads libssl — was the historically portable choice, because robust per-inode attachment with cookie-based PID filtering matured over later kernel versions (Pixie's eBPF SSL tracing writeup walks through the model). With our current minimum kernel floor we can attach once at the library inode and let the kernel route events to the agent's filter, regardless of how many processes have libssl mapped.

Attaching at the library inode replaces N per-process attachments with one. The events still arrive tagged with PID; the kernel does the routing.

Per-call uprobe overhead lands in the low-microsecond range on current kernels — small per call, but it scales with process count under the per-PID model and pays an additional attach cost on every exec(). The inode-level model amortizes that to a single attachment per shared library, performed once at startup, and the per-call cost no longer scales with process count.

Health-check traffic is now class-aware. Health checks from nginx to localhost typically fire on the order of once a second per worker, return immediately, and look identical to one another — high in volume, low in per-event information. The agent now recognizes the connection shape (loopback, short-lived, conventional health-check URLs) and keeps aggregated counters and latency distributions in full while retaining one representative full trace per N requests. Detecting that health checks have stopped or that their latency profile has shifted is preserved; the per-request trace volume is not. This is the kind of refinement that's easy to defer and hard to argue down at the source, since teams have operational reasons for their health-check rates.

Tightening cache bounds

A sweep of the agent's long-lived caches turned up a Go-specific gotcha worth describing before the fixes themselves. A Go map that grows to N buckets during a traffic spike and then loses most of its entries does not give back the bucket array — Go does not shrink a map's bucket count after deletions. Steady-state memory pins to the worst minute the cache has ever seen. The fix is to periodically rebuild the map: copy the live entries into a new map, drop the old one, let the GC reclaim the buckets. The rebuild itself isn't free, but on caches where the high-water mark dwarfs steady state, the saved memory is significant.

From there, the work was uncontroversial. Every cache that lives for the process lifetime now has both an LRU policy and an absolute entry cap, and caches whose footprint is dominated by transient high-water marks are periodically rebuilt as described above. The same rule applied to a few sample collections (e.g. a per-fingerprint span attribute sampler) where the per-bucket size was bounded but the bucket count wasn't. On a representative production node, steady-state RSS now settles around 180 MB instead of trending toward the previous 400 MB high-water mark — most of the recovered footprint is Go map buckets that earlier traffic spikes had pinned, with the rest from caches whose bucket count had drifted past their working set.

The general rule we follow now: any cache that lives for the process lifetime needs both an eviction policy and an absolute size ceiling, and if it's a Go map whose size varies a lot under load, periodic rebuild is worth doing too.

What the agent can now see

The same release also extended what the agent can see. None of these additions are big enough for their own announcement; each one closes a real gap that was easy to miss until an investigation hit it.

Unix sockets

AF_UNIX traffic now appears in the service map. The eBPF layer previously only attached to IP-based connection events, so connections over Unix-domain sockets — a common shape for PHP-FPM behind nginx, sidecar proxies, and control-socket daemons — registered as endpoints rather than as edges between them. The fix attaches the relevant unix_stream_connect and close-path hooks and surfaces the socket path as the link name, so a service connected through /var/run/foo.sock shows the socket path on the connecting edge.

One heads-up for existing deployments: customers running sidecar-heavy or PHP-FPM workloads will see new edges appear in their service map after upgrade. The graph is more complete, not different — but if you've been using the previous map shape as an implicit baseline, the comparison will look noisier for a release.

PostgreSQL schema collection

The agent now periodically pulls table and column metadata from PostgreSQL instances it discovers on the host and pushes it to the platform as a structured OTLP log record — logs are the most flexible OTLP signal for structured non-event payloads, and reusing the existing exporter avoided inventing a fourth pipeline for occasional metadata. Discovery is automatic: the agent watches /proc/*/net/tcp for processes that look like Postgres listeners and probes them with the protocol's startup handshake, regardless of port. Multiple databases on a host are handled. The schema is refreshed once an hour by default.

What this is for: when you are looking at a slow query in the platform, the schema tells you which indexes the query could have used, what column types it is operating on, and which tables it joins. Without this, the only way to get that information is to open psql against the production database — a workflow with friction. Having the schema in the platform lets the slow-query tooling explain itself.

Redis over TLS

Redis-over-TLS now decodes through the same RESP parser as plaintext Redis. The OpenSSL uprobe path that already handles TLS-wrapped HTTP/1.1 and HTTP/2 now classifies Redis on the encrypted side as well, so commands and keys are visible in traces of TLS-secured Redis connections. For teams running encrypted Redis, the trace no longer goes silent at the TLS handshake.

Container names on the same node

The last item is small and mostly cosmetic, but worth mentioning because the cosmetic version is the one humans actually use. Traffic between containers on the same Kubernetes node used to show raw pod IPs on both sides of the connection, which made the service map readable only by someone with kubectl open. The agent now resolves both ends of a same-node connection to container metadata at trace time and displays the container name. The graph is the same shape; it is just legible.

What this kind of work tends to look like

Reading the list back, the changes fall into three buckets that recur in agent work in general.

The first bucket is cost the agent was paying for nothing — connections that were duplicated where they could be shared, parsers that allocated where they could borrow, compression that was off by default on a payload it always pays to compress. These tend to compound silently, because nothing on the dashboard changes when they get worse, and the agent's own self-metrics are usually the only place they show up. Auditing them periodically against the things downstream actually reads is the only way to catch them; they will not raise their hand.

The second bucket is cost the agent was paying for something, but more than necessary — full traces on traffic that could be sampled, per-process hooks where a kernel update lets the work be shared, cache footprints that have outgrown their working set. These are the cases where the design was once right and then the operating environment moved underneath it. Sampling and bounding are usually retrofittable; the harder part is noticing the budget is gone.

The third bucket is visibility gaps the agent did not know it had — the Unix sockets, the schema metadata, the Redis-over-TLS branch. None of these surface in a feature backlog because nobody asks for them in advance. They surface in the postmortem, when somebody is trying to figure out what happened and discovers that the data they would need is the one shape the agent does not record. The way they tend to enter the backlog at all is somebody saying, "we should have known about this last Tuesday," and somebody else writing it down.

That last shape is, broadly, how we tend to think about agent work in general. The agent's job is to make sure the data is not the limiting factor when someone is trying to figure out what happened. Most of the time, that work is invisible — until the postmortem, when the question turns out to be "what was the schema of the slow query," or "which container on this node was talking to that one," or "why did the trace stop at the TLS handshake." Visibility gaps almost always announce themselves in the writeup, well after the incident itself. The release notes are quieter than the incidents they prevent.

Latest articles

Get started.
See your systems clearly.
Ship faster

By clicking 'Get Started', you're agreeing to our Privacy Policy

Robot with a looking glass