Reducing On-Call Escalations to Dev Teams — Part 3

Evgeny Potapov, CEO, co-founder, (on LinkedIn, and on X.com)

Server room

This is Part 3 of our observability series. Read the previous articles:

There's plenty written about managing on-call stress for SRE engineers. Much less attention goes to development teams, who bear the burden of incident escalations. This article covers ten practices that reduce escalations from on-call to dev teams.

1. Train Dev Teams on Observability

Problem: Observability is a wide area with many pitfalls. Without proper training, developers produce instrumentation too vague to be useful. On-call engineers can't diagnose issues and escalate to dev teams instead.

Approach: Train developers to build metrics, logs, and tracing into features as they write them. Allocate time for this in schedules — don't treat it as something to rush at the end.

Benefit: On-call engineers get actionable data, reducing escalations.

2. Analyze Failure Modes During Development

Problem: Without dedicated time to think about how a service can fail, teams are unprepared when it does. Longer downtime, more people pulled in.

Approach: During development, identify specific failure scenarios, the alerts they'd trigger, and the steps to diagnose and resolve them. Create or update runbooks.

Benefit: On-call teams respond faster and more independently.

3. Make Deployments Revertible

Problem: If a deployment can't be reverted, fixing production issues requires finding the developer who wrote the code. Delays pile up while the issue persists.

Approach: Plan database and application changes so they can be rolled back without breaking the system. For schema changes, run new and old structures in parallel and maintain backward compatibility during a transition period.

Benefit: On-call teams revert and restore service quickly. Developers investigate without production pressure.

4. Standardize Observability Across Teams

Problem: When each team uses different metrics formats, log structures, and monitoring approaches, on-call engineers waste time understanding each service's setup before they can diagnose anything.

Approach: Establish and enforce standards for tracing, metrics, and logs across all teams. Require compliance when developing or modifying features.

Benefit: Faster issue identification. On-call teams work independently across services.

5. Define Clear Escalation Paths

Problem: Ambiguous escalation procedures cause delays and miscommunication during incidents.

Approach: Document who to contact and when for each service. Keep contact information, roles, and responsibilities accessible.

Benefit: Faster escalation when needed. No more calling people uninvolved with a given service.

6. Require Knowledge Sharing from Developers

Problem: On-call engineers who don't understand new or updated services will escalate to the developers who built them.

Approach: Developers regularly share knowledge about service architecture, interactions with other services, and database dependencies — through workshops, documentation, or regular meetings.

Benefit: On-call teams handle incidents with new services without pulling in developers.

7. Maintain Runbooks

Problem: On-call engineers escalate because they don't know how to handle specific issues.

Approach: Create detailed runbooks in collaboration with dev teams, covering common issues for each service. Review and update them regularly.

Benefit: On-call engineers resolve issues independently, reducing escalation frequency and response times.

8. Mirror Production in Pre-Production

Problem: Staging environments often contain in-progress features, creating differences from production. Features tested in such environments may break in production.

Approach: Maintain a pre-production environment that mirrors production in database structure and ideally data. Test deployments under production-like conditions.

Benefit: Fewer bugs reach production, reducing urgent reverts and fixes.

9. Deploy Smaller Changes More Often

Problem: Large deployments increase failure risk and make troubleshooting harder.

Approach: Reduce the scope of changes per deployment. Deploy more frequently with smaller changesets.

Benefit: When something breaks, there's a smaller set of changes to investigate.

10. Give Developers Recovery Time After Incidents

Problem: Developers called in for incidents are often expected to return to normal work immediately. Those called at night or on weekends don't get adequate time off.

Approach: Allocate recovery time after incident response. For daytime incidents, provide some downtime. For nights and weekends, offer a couple of days off.

Benefit: Prevents burnout and turnover. Developers return rested and productive.

Latest articles

Get started.
See your systems clearly.
Ship faster

By clicking 'Get Started', you're agreeing to our Privacy Policy

Robot with a looking glass