Reducing On-Call Escalations to Dev Teams — Part 3

Evgeny Potapov, CEO, co-founder, (on LinkedIn, and on X.com)

Server room

This is Part 3 of our observability series. Read the previous articles:

There's plenty written about managing on-call stress for SRE engineers. Much less attention goes to development teams, who bear the burden of incident escalations. This article covers ten practices that reduce escalations from on-call to dev teams.

1. Train Dev Teams on Observability

Problem: Observability is a wide area with many pitfalls. Without proper training, developers produce instrumentation too vague to be useful. On-call engineers can't diagnose issues and escalate to dev teams instead.

Approach: Train developers to build metrics, logs, and tracing into features as they write them. Allocate time for this in schedules — don't treat it as something to rush at the end.

Benefit: On-call engineers get actionable data, reducing escalations.

2. Analyze Failure Modes During Development

Problem: Without dedicated time to think about how a service can fail, teams are unprepared when it does. Longer downtime, more people pulled in.

Approach: During development, identify specific failure scenarios, the alerts they'd trigger, and the steps to diagnose and resolve them. Create or update runbooks.

Benefit: On-call teams respond faster and more independently.

3. Make Deployments Revertible

Problem: If a deployment can't be reverted, fixing production issues requires finding the developer who wrote the code. Delays pile up while the issue persists.

Approach: Plan database and application changes so they can be rolled back without breaking the system. For schema changes, run new and old structures in parallel and maintain backward compatibility during a transition period.

Benefit: On-call teams revert and restore service quickly. Developers investigate without production pressure.

4. Standardize Observability Across Teams

Problem: When each team uses different metrics formats, log structures, and monitoring approaches, on-call engineers waste time understanding each service's setup before they can diagnose anything.

Approach: Establish and enforce standards for tracing, metrics, and logs across all teams. Require compliance when developing or modifying features.

Benefit: Faster issue identification. On-call teams work independently across services.

5. Define Clear Escalation Paths

Problem: Ambiguous escalation procedures cause delays and miscommunication during incidents.

Approach: Document who to contact and when for each service. Keep contact information, roles, and responsibilities accessible.

Benefit: Faster escalation when needed. No more calling people uninvolved with a given service.

6. Require Knowledge Sharing from Developers

Problem: On-call engineers who don't understand new or updated services will escalate to the developers who built them.

Approach: Developers regularly share knowledge about service architecture, interactions with other services, and database dependencies — through workshops, documentation, or regular meetings.

Benefit: On-call teams handle incidents with new services without pulling in developers.

7. Maintain Runbooks

Problem: On-call engineers escalate because they don't know how to handle specific issues.

Approach: Create detailed runbooks in collaboration with dev teams, covering common issues for each service. Review and update them regularly.

Benefit: On-call engineers resolve issues independently, reducing escalation frequency and response times.

8. Mirror Production in Pre-Production

Problem: Staging environments often contain in-progress features, creating differences from production. Features tested in such environments may break in production.

Approach: Maintain a pre-production environment that mirrors production in database structure and ideally data. Test deployments under production-like conditions.

Benefit: Fewer bugs reach production, reducing urgent reverts and fixes.

9. Deploy Smaller Changes More Often

Problem: Large deployments increase failure risk and make troubleshooting harder.

Approach: Reduce the scope of changes per deployment. Deploy more frequently with smaller changesets.

Benefit: When something breaks, there's a smaller set of changes to investigate.

10. Give Developers Recovery Time After Incidents

Problem: Developers called in for incidents are often expected to return to normal work immediately. Those called at night or on weekends don't get adequate time off.

Approach: Allocate recovery time after incident response. For daytime incidents, provide some downtime. For nights and weekends, offer a couple of days off.

Benefit: Prevents burnout and turnover. Developers return rested and productive.

Latest articles

Observability Strategy Checklist — Part 1
Server room

A layer-by-layer checklist for observability — from user experience monitoring to infrastructure and user feedback.

Get started.
See your systems clearly.
Ship faster.

By clicking 'Get Started', you're agreeing to our Privacy Policy

Robot with a looking glass