The failure that felt normal

Blog / June 4th, 2026

The most expensive failures are not the ones that page you at three in the morning. They are the ones nobody pages about at all, because the system has behaved this way for as long as anyone can remember, and a thing that has always happened stops looking like a thing that is happening. It just becomes the weather.

One of our clients ran an AI image feature that failed on a schedule. Every day, for most of the day, for the better part of a year. Users would ask the product to generate an illustration, get a generic error, and many of them would never try again. The team knew the feature was “a bit flaky.” What they did not know, until they moved onto ApexData and an audit laid it out hour by hour, was that the flakiness had a shape, the shape had a cause, and the cause had been shipped on the same day as the feature. This is the story of how that surfaced, and why a defect present from the first deploy is the hardest kind to see.

A feature that had always behaved this way

The client is a learning platform. Among other things, their product lets a user generate images on demand — an illustration for a lesson, cover art for a course, a picture to pin to a flashcard. Behind a button in the app, a server route calls a hosted image model — Google’s Gemini — and returns the result. When it works, it is the kind of small delight that makes people stay. When it does not, the user sees a short, generic error and moves on.

It did not work for a predictable stretch of every day. Support saw a trickle of tickets — “image generation is broken,” “it keeps erroring” — and the standing answer was that AI features are like that: third-party, probabilistic, occasionally down. The team had no “before” in which the feature had behaved differently, so there was no regression to point at and no moment when something got worse. The errors were part of the baseline. They had been part of the baseline since launch. Nobody had wired an alert to them, because you do not alert on the weather.

The client was not flying blind, either. They had been running an observability system the whole time, and the errors were in it — sitting in the logs every day, as they had since launch. What the system did not have was a chart or an alert pointed at them, which turns out to be the entire difference. We will come back to that.

The shape on the audit

The pattern fell out when the client moved onto ApexData and ran an audit with its investigation agent — the surface that runs the telemetry queries an engineer would otherwise run by hand, and gathers the answers in one place. ApexData sees a client’s telemetry, not their source code, so what the audit surfaced was the shape of the failure, not the line behind it; the code changes at the end happened in the client’s own repository. Over a representative two-day window, the image route had logged a few hundred failed image requests. That alone is unremarkable; plenty of routes carry a low error rate forever. What was not unremarkable was when the errors happened. Plotted by hour and lined up against the model’s daily quota reset, they did not scatter. Each day told the same story: a short quiet spell right after the reset, then the failures switched on and stayed on until the next reset cleared the counter.

The same audit, aligned to the model’s daily quota reset — midnight Pacific time, around 07:00–08:00 UTC depending on daylight saving. For the first couple of hours after the reset a few requests still get through; the short grey bars are the rarer second failure mode, a request that reached the model and came back with no image. Then the day’s small allowance runs out, the red 429s switch on, and they stay on until the next reset. The shape repeats, locked to the reset clock — which is rarely what your own code does.

That alignment is the tell. Random faults do not keep a schedule. A failure that switches on at the same point relative to a fixed daily boundary, and holds until that boundary comes round again, is following something with a daily period — a cron job, a batch window, a billing boundary, or a quota that resets on a clock. The reset alignment pointed at the last one, and the error text confirmed it.

Zero requests left for the day

The red bars were all the same upstream failure: the image model rejecting the request with a rate-limit status and a message that the daily quota had been used up. Trimmed to the part that matters, it read:

429 Too Many Requests
You exceeded your current quota — daily limit for this
image model reached. Quota resets at the start of the day.

The quota in question is a per-day cap on how many image requests the project may make against that specific model. It is not a per-second throttle you ride and recover from within the same minute; it is a fixed allowance for the day, refilled once every twenty-four hours when the quota resets at midnight Pacific time — around 07:00–08:00 UTC, depending on daylight saving. On the plan the client was on, that daily allowance was small relative to their traffic. So each day played out the same way: the reset refilled the counter, the first couple of hours of real usage drained it, and from the moment it hit zero until the next day’s reset — the better part of a full day — every image request was refused. The next reset started the cycle again.

This is worth sitting with, because it reframes the whole problem. The feature was not unreliable in the way the team had assumed — flickering on and off at random because AI is mysterious. It was a fixed quantity of capacity, provisioned too low, consumed early, and then absent for the rest of the day. The ceiling was the bug. And the ceiling had been that height since the feature launched, because nobody had sized it against real traffic in the first place — they had wired up the model on a starter plan to ship the feature, and the starter plan’s daily allowance had quietly been the real capacity of the feature ever since.

The signal nobody was watching

Here is the part that makes the year-long silence so striking: the error was never hidden. The route did the honest thing with the status code and passed the upstream’s 429 straight back. So the rate-limit errors were not buried inside some generic failure. They sat in the logs, plainly labelled as 429s, every single day, for the whole life of the feature.

The gap was on the watching side, not the recording side. The client’s previous observability system had no chart that broke this route out by response code, and no alert that fired when rate-limit errors appeared or climbed. A 429 sitting in a log file that no dashboard counts and no alert watches is, in operational terms, indistinguishable from no signal at all. The data existed; the attention did not. Anyone who had gone looking would have found the daily pattern in an afternoon — and nothing ever prompted anyone to go looking.

The signal was in the logs the whole time. The previous system recorded the real 429 faithfully and then did nothing with it — no chart broke the route out by status code, no alert fired on the daily 429s — so a true signal sat unread every day for a year.

This is the kind of failure that does not hide because the evidence is missing. The evidence was present and correct; it was simply unwatched — which is the more dangerous shape, because nothing about a faithfully recorded, uncharted error ever asks to be noticed.

Why nobody escalated

Put the two pieces together and the failure was both real and quietly evidenced: a capacity ceiling that emptied early every day, recorded faithfully in the logs, and watched by nothing. But neither piece fully explains the most striking part of the story, which is that the team had lived alongside this for the better part of a year and never escalated it. To explain that, you have to look at the human layer, not the stack.

The sociologist Diane Vaughan, studying the decisions that led to the Challenger launch, named the mechanism: the normalization of deviance. A practice that is clearly outside the bounds of safe or correct behavior becomes accepted as normal precisely because it keeps not causing a disaster. Each time the deviant thing happens without catastrophe, the boundary of “acceptable” quietly widens to include it. After enough repetitions, the deviation is the standard, and proposing to fix it sounds like proposing to fix something that is not broken.

An AI image feature erroring for most of every day is not a space shuttle. But the mechanism is the same, and it is sharpened by one detail specific to defects that ship on day one: there is no “before.” A regression has a healthy past to be compared against — a line that was flat and then stepped up, a week that was fine and then was not. That contrast is what makes a regression visible. A defect present from the first deploy has no such contrast. The broken level is the baseline. There was never a version of this feature that worked all day, so nothing in the data ever stepped, and the daily outage read as the feature’s normal personality rather than as a problem with a fix.

Why a day-one defect hides. Anomaly detection works by contrast with a baseline. A regression supplies that contrast for free. A defect that ships with the feature makes the broken level the baseline, and a flat line — however high — does not trip an alarm built to catch a change.

None of the usual signals contradicted the “this is normal” story either. Uptime was green; the server was up and answering. The overall error rate carried a small, steady contribution from the route, well under any threshold worth setting. The thing that was actually wrong — the feature is unavailable for most of every day, from the moment the allowance ran out until the next day’s reset — was not a quantity anyone was watching, because no dashboard tracked the feature’s availability as its own number. It was folded into aggregates that looked fine.

The churn nobody attributed to a bug

The cost did not show up where the team was looking. It showed up somewhere they were not connecting to a bug at all: in whether people came back.

Consider what a user hitting the feature during the outage experienced. They opened the feature, asked for an image, and got a short, generic error. They had no way to know the feature would work fine if they came back after the next day’s reset; the error did not say “try later,” it just said something failed. A reasonable person concludes the feature does not work, and a meaningful fraction of them never touch it again. That decision generates exactly one log line and zero alerts. It does not dent the error budget. It does not show up in latency. It is invisible to every operational metric, and it is the most expensive outcome in the whole story, because it is paid in users.

The two paths out of a silent failure. The branch operations can see — a log line — is harmless. The branch that costs the business — a user deciding the feature is broken — is the one nothing in the stack is built to count.

There is a quieter second-order effect worth naming. When a feature shows low adoption, the natural read is that users do not want it, and the natural response is to invest less in it. But “users do not want this” and “users cannot use this for most of the day” produce the same adoption chart while calling for opposite decisions. A team measuring only engagement, with no feature-level success metric and no alert on the errors quietly piling up in the logs, can talk itself into deprecating a feature that was simply never given enough capacity to work.

The fix, which was the easy part

Once the shape was clear, the remediation was ordinary engineering, and most of it took about a day.

The first move cost almost nothing: put the existing signal under watch. A chart that broke the route out by response code, and an alert that fired when rate-limit errors crossed a low threshold, turned a year-long blind spot into a same-day heads-up — the 429s had been there the whole time, waiting for something to count them. On the product side, the user-facing error was reworded so a rate-limit response reads as “the feature is busy, try again shortly” rather than a flat “something failed,” which turns a dead end into a recoverable state. The occasional empty-response failures got their own handling, so the two stopped being lumped together.

The second move was to treat the quota as capacity to be provisioned and watched, not a surprise to be discovered when it runs out. That means sizing the plan to real traffic instead of to whatever the feature shipped on, and alerting when daily usage crosses a fraction of the limit so there is time to act before users feel it. It is worth being precise about what does and does not help here. Client-side throttling spreads usage so the day’s allowance lasts longer, but it only delays exhaustion. Backoff and retry — the reflex for a rate-limit error — do nothing once a per-day cap is gone: there is nothing transient to recover, so a retry just earns another refusal until the next reset. That leaves the durable fix, which is a fallback. When the primary model is exhausted, route to a second provider, so the ceiling becomes a slight degradation rather than a wall. A feature that can fail over does not have a single daily cliff.

None of that is novel. All of it was available to the team from the day they launched. What was missing was never the fix — it was the recognition that a fix was warranted, against a year of evidence that everything was normal.

Lessons

We see versions of this across teams shipping AI features on metered third-party APIs. Four worth writing down.

“It has always done that” is a reason to investigate, not a reason to relax. A baseline you have never compared against a target can be a defect wearing the costume of normal. Defects that ship on day one are the most dangerous precisely because they never produce the before-and-after contrast that anomaly detection — and human attention — is built to catch.
The failure that costs you customers may never page you. A user who hits an error and quietly leaves generates one log line and no alert; the cost lands in retention, not in the error budget. For anything user-facing, measure the feature’s own success rate from the user’s side, not just whether the server is up.
An error in your logs is not a signal until something watches it. Here the route returned the right status code all along; the previous system simply had no chart that broke the route out by response code and no alert that fired when rate-limit errors appeared. Data you record but never chart or alert on is, in practice, data you do not have. Put the number on a dashboard and a threshold on the number.
A per-day quota is capacity you provision and watch. A daily cap fails on a schedule locked to its reset clock: requests succeed until the day’s allowance drains, then every request is refused until the next reset. Throttling spreads usage so the allowance lasts longer, but only a bigger cap or a fallback provider keeps the feature up — backoff cannot recover a request once the day’s quota is gone. Size the quota to real traffic, alert well below the limit, and keep a fallback so a ceiling is a degradation instead of an outage.

The fix here was a day of work: put the errors on a chart and an alert, raise and watch the quota, and add a fallback. The hard part, and what this story is really about, was noticing that a thing everyone had agreed to call normal was a bug — and that the feature nobody seemed to want was a feature nobody could reliably reach. When a system has behaved a certain way since the day it launched, the absence of complaints is not evidence that it is healthy. It may only be evidence that the people who hit the problem already left.

Share it:

Latest articles

May 28th, 2026 • Andrey Shamakhov, CTO, co-founder

What counts as evidence: grading the output of a tool-using agent

A pattern for tool-using AI agents — make findings the unit of output, gate the top-tier severity claim on objective evidence at record time, and require cross-source corroboration for any elevated claim at synthesis time. The agent's report becomes something a reviewer can rank.

EXPLAIN, offline — Reconstructing a query plan from collected statistics, with no connection to the database

May 27th, 2026 • Evgeny Potapov, CEO, co-founder

What Postgres knows about your tables

How to predict what PostgreSQL would do with a query without running it — the statistics the planner reads, where they live, and how an offline analyzer reconstructs the plan from collected pg_class, pg_indexes, and pg_stats data, plus the honest boundary between structural prediction and the live cost model.

Bending without breaking — May 15 release: OpenSSL durability, kernel fallback, PHP-FPM self-heal

May 15th, 2026 • Artur Asadullin, Lead Infrastructure Engineer

Release: 2026-05-15 - Bending without breaking: notes from a mid-May agent release

Notes from a mid-May 2026 release of our observability agent. A configurable trace sampling rate and per-process opt-out for high-volume tiers, an OpenSSL interception path that no longer tracks libssl internal layouts, a graceful-degradation mode for older kernels, PHP-FPM monitoring that self-heals across worker recycles, and a cleaner startup log.