On-call stress for SRE engineers is a well-traveled topic. The other side of the same arrangement gets less attention: the developers who get escalated to. Each escalation is not just an hour of a developer's time. It is a context-switch off the feature they were building, an interrupted train of thought that takes the rest of the morning to recover from, and — for the after-hours ones — a hit to next week's productivity. Multiplied across a team, the rate of escalations is one of the larger uncosted line items in the engineering budget.
This is Part 3 of a series on building an observability strategy. Part 1 walked through the layers of what to monitor and why. Part 2 argued that observability is code, not tooling, and roughly a third of the codebase in a well-instrumented system. This part is about what changes about the on-call side once those first two are in place, and the practices that turn an on-call shift into a job the on-call engineer can do without paging a developer for most of it.
The practices below are not new. Several of them are folklore in mature operations teams. What is worth saying about them as a set is that they reduce escalations only when adopted together — each one closes a specific path by which an incident becomes a 2 a.m. phone call, and the paths are not redundant. The reader who already does seven of these gets some benefit from each individually and outsized benefit from closing the last three.
Instrumentation an on-call engineer can read
The first step of an incident response is reading the dashboard. If the dashboard does not answer the obvious question — what is broken, by how much, since when — the on-call engineer has nothing to do at that step except page the developer. The work that prevents this happens during development, long before the incident.
The practical version is two practices held together. Train developers to write metrics, logs, and traces that an operator can act on without context from the developer; and standardize how they are written across teams so the operator does not have to learn each service's idioms before reading them. Part 2 walked through what those standards are about — naming, label cardinality, span boundaries, log structure — and why they are worth treating as engineering rather than a final-day cleanup. Part 3's frame is the consequence: the standards exist for the person at 2 a.m. who did not write the code, and the only test of whether they are good enough is whether that person can read them.
The failure mode in teams that skip this is recognizable. The metric exists, but its label is the literal name of the field rather than something an outsider would understand. The log line records what happened but not which request it happened on. The trace shows that something was slow but not which operation, because the span was scoped around the whole handler rather than the slow call inside it. None of these break the service; all of them push the on-call engineer toward a page.
Runbooks and knowledge transfer
Assuming the on-call engineer can read the dashboard, the next question is whether they know what to do about what they see. Two practices answer that question, and a third turns the answers into a durable artifact.
The first is failure-mode analysis during development. The exercise is simple to describe — for each new service or significant change, enumerate the specific ways it can fail, the alerts each failure would trigger, and the diagnosis and resolution steps for each — and most teams do not do it, because the time it takes is invisible to anyone who is not on call when the failure happens. Allocating an hour of design time to this question consistently saves a multiple of that in escalation time over the service's life. The output is not a document for its own sake; it is the raw material for the runbook.
The second is knowledge sharing from the developers who built the service to the engineers who will be on call for it. The shape that works is durable — a written walkthrough of the service's architecture, its dependencies, its known failure modes — supplemented by a one-time live session where the on-call team can ask the questions the document did not anticipate. The shape that does not work is implicit: assuming the on-call team will pick it up by reading the code or attending standups. They will not, and the escalation rate is the measurement of that gap.
The third is the runbook itself. The artifact where the failure modes, the dashboards that confirm each one, and the steps to resolve them live in one place. Runbooks have a maintenance problem: they go out of date faster than anyone expects, because the system they describe is changing. The discipline is to update the runbook in the same diff that changes the service — same way the changelog gets updated, same review pressure. A runbook that is wrong is worse than a runbook that is missing, because the on-call engineer follows it and ends up further from a resolution than where they started.
Reversible, small deploys
If the dashboard and the runbook do not resolve the incident, the next move is usually to undo whatever changed most recently. Three practices make that move available, and the absence of any of them turns a one-line rollback into a page to the developer who wrote the change.
Make deployments revertible. Database schema changes are the place this most often fails: a column is dropped, a constraint is added, a foreign key is repointed, and rolling back the application no longer works because the schema it expects no longer exists. The discipline is to plan schema changes in two phases — the new shape lands first while the old one still works, the application is updated to use the new shape, and only later (a release or two later) is the old shape removed. The cost is a little extra schema work and a slightly slower migration. The benefit is that any deploy in between can be rolled back in seconds.
Mirror production in pre-production. A pre-production environment whose database is empty or whose data shape is different from production catches none of the bugs that depend on production-shaped data. The fix is to keep pre-prod close enough to prod — scrubbed of sensitive fields where required, but close in schema and in the cardinality of the data — that a deploy passing pre-prod is a meaningful signal. The fully accurate version of this is expensive; the version that catches most of what matters is to mirror schema and reasonable data shapes, not the byte-for-byte production database.
Deploy smaller changes, more often. Two changes shipped together are harder to diagnose when one of them breaks than the same two changes shipped separately. The arithmetic compounds: ten changes in one deploy turn an investigation into a search problem. The shape that helps is small, frequent deploys with the rollback discipline above; when something breaks, there is one change to suspect, and the rollback brings the system back to a known state.
When escalation is still needed, and what comes after
No set of practices reduces the escalation rate to zero. The remaining incidents are the ones where the cause is genuinely in the developer's head, and the only path to resolution runs through paging them. Two practices apply to those.
Define clear escalation paths. The on-call engineer should not be guessing who owns which service or which developer is responsible for which subsystem; the directory should be one document, kept current, with names and contact methods and the responsibilities each name covers. The failure mode here is calling the wrong person at 3 a.m. and waiting an hour while they figure out it is not their service. The cost of getting this right is the half-day it takes to write the document and the discipline of updating it when teams reorganize.
Give developers recovery time after they have been pulled in. A developer who responds to an incident at 2 a.m. and is then expected at standup at 9 a.m. on three hours of sleep is not a productive developer that day or the next one. The fair allocation is straightforward: incidents during business hours get some compensatory downtime in the same week; night and weekend incidents get a couple of days off. Teams that do not allow this discover, eventually, that the developers who are willing to be on call are the ones leaving, and the remaining team has all the on-call duty and no slack to spare.
What this kind of strategy tends to look like
Reading the practices back, the escalation rate is the visible metric a strategy of this shape produces. It is also a lagging indicator: a team that adopts these practices does not see the rate drop in the first month, because the incidents that month are mostly in the runbook gap, the dashboard gap, the deploy-discipline gap that take time to close. The drop arrives over quarters, as each practice closes one of the paths the previous month's incidents were taking. The teams that measure the rate and watch it move are the ones that recognize the pattern and stay with the practices long enough for the benefit to compound.
The pattern across the three parts of this series is, broadly, design, build, sustain. Part 1 was the layers an observability strategy is supposed to cover — what each one is for and where it fits in the descent from user-facing signals to infrastructure. Part 2 was what it costs to build those layers as real engineering work — 30% of the code, 28% of the development time, with the same disciplines as the rest of the codebase. Part 3 has been about what happens after the layers exist and the code is in production: the practices that keep the on-call rotation a job the on-call engineer can do, and that keep the developers who built the system from having their week absorbed by it every time something goes wrong.
Observability, read across the three parts, is not a tool you buy. It is a strategy you adopt, an engineering practice that takes real time to build, and a set of operational habits that turn the cost of that practice into reliability the rest of the team can rely on.