Engineering Notes

The failure that felt normal — A daily AI quota that drained for a year, normal only because it always had

The failure that felt normal

An investigation story: a client's AI image feature failed for most of every day, on the same daily schedule, since the day it launched — a Gemini per-day quota that drained early each day. Because it had always behaved that way, the team called it normal and quietly lost the users who hit it. How an ApexData audit found the pattern, and why a defect present from the first deploy is the hardest kind to see.

What counts as evidence — grading the output of a tool-using agent

What counts as evidence: grading the output of a tool-using agent

A pattern for tool-using AI agents — make findings the unit of output, gate the top-tier severity claim on objective evidence at record time, and require cross-source corroboration for any elevated claim at synthesis time. The agent's report becomes something a reviewer can rank.

EXPLAIN, offline — Reconstructing a query plan from collected statistics, with no connection to the database

What Postgres knows about your tables

How to predict what PostgreSQL would do with a query without running it — the statistics the planner reads, where they live, and how an offline analyzer reconstructs the plan from collected pg_class, pg_indexes, and pg_stats data, plus the honest boundary between structural prediction and the live cost model.

Bending without breaking — May 15 release: OpenSSL durability, kernel fallback, PHP-FPM self-heal

Release: 2026-05-15 - Bending without breaking: notes from a mid-May agent release

Notes from a mid-May 2026 release of our observability agent. A configurable trace sampling rate and per-process opt-out for high-volume tiers, an OpenSSL interception path that no longer tracks libssl internal layouts, a graceful-degradation mode for older kernels, PHP-FPM monitoring that self-heals across worker recycles, and a cleaner startup log.

Closer to the question — April 3 release: logs, dashboards, and kernel-level signals

Release: 2026-04-03 - Closer to the question: notes from the April 3 release

Notes from an April 2026 release: the logs page redrawn with a collapsible filter sidebar and resizable columns, Lighthouse split onto its own tab in URL monitoring, an AI dashboard agent that reads its own rendered output and revises itself, and an observability agent that now records packet drops, DNS query shape, TCP round-trip time, and syslog alongside the application-protocol view.

Around a slow query — March 17 release: concurrent requests, offline EXPLAIN, deduplicated logs

Release: 2026-03-17 - Around a slow query: notes from the March 17 release

Notes from a March 2026 release: more context around a slow query in the platform — concurrent HTTP requests matched by interval overlap, an offline SQL Explain analyzer, and a cross-pod comparison toggle — plus log deduplication, request profiling on the OTLP pipeline, and resilient handling of malformed spans in the agent.

Two ways to draw a dashboard — March 3 release: generated dashboards, model fallback, hybrid log search

Release: 2026-03-03 - Two ways to draw a dashboard: notes from a March release

Notes from a March 2026 release: AI-generated dashboards with a visualization linter and cross-provider model fallback, three hand-built dashboards for APM, pod restarts and deployments, and a hybrid full-text log search powered by Manticore.

Three more places to look — Feb 9 release: tree-search investigations, ingress dashboards, URL monitoring

Release: 2026-02-09 - Three more places to look: notes from a February release

Notes from a February 2026 release: a tree-search investigation mode, dedicated ingress-controller dashboards with throttling and scheduling delay, and synthetic URL monitoring from outside the cluster.

The cost of watching — Grogh 0.17782: halved GC overhead, shared OTLP connection, leaner eBPF probes

Grogh 0.17782: The cost of watching — notes from an observability agent release

Notes from a recent release of our observability agent. Lower GC, network, eBPF, and memory cost on the host, plus visibility into Unix sockets, PostgreSQL schemas, Redis over TLS, and same-node container connections.

AI-First Coding: Closing the Gap Between Skeptics and Practitioners in Dev Teams

AI-First Coding: Closing the Gap Between Skeptics and Practitioners in Dev Teams

Notes from a Tel Aviv meetup talk on AI-first coding in 2026: why developer skepticism is mostly outdated, where the real concerns sit (trust, security, fun), and how to share adoption inside a team without mandating it.

Surviving a wrong first guess — design notes: tree of hypotheses, specialist subagents, scored evidence

Tree-search agents: building an AI agent that survives a wrong first guess

Design notes from building an investigation agent for production incidents: tree of hypotheses, specialised subagents, evidence-scored evaluation, bounded search, and configurable models.

Claude Code Workshop & Best Practices

Claude Code Workshop & Best Practices

Claude Code Workshop & Best Practices Speaker: Evgeny Potapov, ApexData co-founder & CEO

Building observability strategy — Part 3: runbooks, reversible deploys, recovery time

Building an effective observability strategy - Part 3

Once the layers from Part 1 are in place and the code from Part 2 has been written, the on-call experience changes. Part 3 of a series on observability strategy: the practices that reduce the rate of escalations from on-call rotations to the developers who built the system.

Building observability strategy — Part 2: 30% of the codebase, 28% more development time

Building an effective observability strategy - Part 2

Observability is not a tool you buy; it is code you write, and a meaningful fraction of the code in a working production system. Part 2 of a series on observability strategy: how much code is instrumentation, what that means for how a team works, and what it costs in development time.

Building observability strategy — Part 1: an observability checklist covering user experience, tracing, and infrastructure

Building an effective observability strategy - Part 1

An observability strategy designed from the people the system serves, not the boxes it runs on. A top-down tour of the layers — user experience, business signals, tracing, service monitoring, infrastructure, user feedback — and what each one answers.

Get started.
See your systems clearly.
Ship faster

By clicking 'Get Started', you're agreeing to our Privacy Policy

Robot with a looking glass