
Claude Code Workshop & Best Practices Speaker: Evgeny Potapov, ApexData co-founder & CEO
Evgeny Potapov, CEO, co-founder, (on LinkedIn, and on X.com)

This is Part 3 of our observability series. Read the previous articles:
There's plenty written about managing on-call stress for SRE engineers. Much less attention goes to development teams, who bear the burden of incident escalations. This article covers ten practices that reduce escalations from on-call to dev teams.
Problem: Observability is a wide area with many pitfalls. Without proper training, developers produce instrumentation too vague to be useful. On-call engineers can't diagnose issues and escalate to dev teams instead.
Approach: Train developers to build metrics, logs, and tracing into features as they write them. Allocate time for this in schedules — don't treat it as something to rush at the end.
Benefit: On-call engineers get actionable data, reducing escalations.
Problem: Without dedicated time to think about how a service can fail, teams are unprepared when it does. Longer downtime, more people pulled in.
Approach: During development, identify specific failure scenarios, the alerts they'd trigger, and the steps to diagnose and resolve them. Create or update runbooks.
Benefit: On-call teams respond faster and more independently.
Problem: If a deployment can't be reverted, fixing production issues requires finding the developer who wrote the code. Delays pile up while the issue persists.
Approach: Plan database and application changes so they can be rolled back without breaking the system. For schema changes, run new and old structures in parallel and maintain backward compatibility during a transition period.
Benefit: On-call teams revert and restore service quickly. Developers investigate without production pressure.
Problem: When each team uses different metrics formats, log structures, and monitoring approaches, on-call engineers waste time understanding each service's setup before they can diagnose anything.
Approach: Establish and enforce standards for tracing, metrics, and logs across all teams. Require compliance when developing or modifying features.
Benefit: Faster issue identification. On-call teams work independently across services.
Problem: Ambiguous escalation procedures cause delays and miscommunication during incidents.
Approach: Document who to contact and when for each service. Keep contact information, roles, and responsibilities accessible.
Benefit: Faster escalation when needed. No more calling people uninvolved with a given service.
Problem: On-call engineers who don't understand new or updated services will escalate to the developers who built them.
Approach: Developers regularly share knowledge about service architecture, interactions with other services, and database dependencies — through workshops, documentation, or regular meetings.
Benefit: On-call teams handle incidents with new services without pulling in developers.
Problem: On-call engineers escalate because they don't know how to handle specific issues.
Approach: Create detailed runbooks in collaboration with dev teams, covering common issues for each service. Review and update them regularly.
Benefit: On-call engineers resolve issues independently, reducing escalation frequency and response times.
Problem: Staging environments often contain in-progress features, creating differences from production. Features tested in such environments may break in production.
Approach: Maintain a pre-production environment that mirrors production in database structure and ideally data. Test deployments under production-like conditions.
Benefit: Fewer bugs reach production, reducing urgent reverts and fixes.
Problem: Large deployments increase failure risk and make troubleshooting harder.
Approach: Reduce the scope of changes per deployment. Deploy more frequently with smaller changesets.
Benefit: When something breaks, there's a smaller set of changes to investigate.
Problem: Developers called in for incidents are often expected to return to normal work immediately. Those called at night or on weekends don't get adequate time off.
Approach: Allocate recovery time after incident response. For daytime incidents, provide some downtime. For nights and weekends, offer a couple of days off.
Benefit: Prevents burnout and turnover. Developers return rested and productive.

Claude Code Workshop & Best Practices Speaker: Evgeny Potapov, ApexData co-founder & CEO

The OpenTelemetry Demo shows 30–45% of code is observability instrumentation. Plan your development timelines accordingly.

A layer-by-layer checklist for observability — from user experience monitoring to infrastructure and user feedback.
