This is the third article in our series on implementing an observability strategy. While being on-call is undeniably challenging, and there are well-known management practices to ease the life of on-call SRE engineers, much less attention is paid to the management practices that can improve the experience for development teams. These teams often bear the burden of frequent incident escalations, and it’s crucial to implement strategies that make their lives easier as well. In this article, we will explore key practices to reduce the amount of escalations to the development team by the on-call team, ensuring smoother operations and better overall efficiency.
For more context, you can read our previous articles:
Problem: Even though observability practices are known among developers, there are a lot of tricky things and the area is so wide that it requires education, not only in terms of what exists, but also on best practices. Lacking such skills can lead to obscurity in diagnosing issues, ultimately increasing the time required to resolve these problems and likely leading to escalation to the development team to jump into the code part.
Approach: Educating teams on implementing observability involves training developers to incorporate comprehensive metrics, logs, and tracing as they build and update features. Allocating proper timelines ensures that observability isn't rushed or tacked on at the end of development but is integrated throughout the development cycle.
Benefits: This practice equips on-call personnel with detailed and actionable data when they need to troubleshoot issues, thereby reducing the escalation to development teams.
Problem: Without dedicated time to understand potential service failures, teams may be unprepared for outages, leading to prolonged downtime and inefficient incident handling.
Approach: During the development phase, allocate dedicated time for teams to thoroughly understand the potential points of failure within a service. This involves pinpointing specific failure scenarios, the types of alerts that could be triggered in such scenarios, and developing a streamlined approach for diagnosing and resolving issues. This phase should include creating or updating runbooks that provide clear instructions on how to handle known problems.
Benefits: By proactively understanding and documenting how to deal with possible failures, on-call teams can respond more quickly and effectively, reducing both the downtime and the amount of people involved.
Problem: If a deployment cannot be reverted, fixing the issue will require contacting the developer who implemented the service. This often results in significant delays and stress as the issue remains unresolved in production.
Approach: This involves planning and implementing database and application changes in a way that they can be easily reverted without affecting the integrity of the system. For instance, when updating database schemas, run the new structure in parallel with the old one, and maintain backward compatibility for a transitional period.
Benefits: This method helps to simply revert the service if problems occur due to bugs or similar issues, thus allowing time to investigate and fix the problem while maintaining a working version on production.
Problem: Inconsistencies in observability practices across teams can lead to gaps in monitoring and additional time to understand the cause of the issue as it will take time to understand what the service metrics are/format of the logs, etc, if they are unique to such service.
Approach: Establishing and enforcing standards for tracing, metrics, and logs across all teams ensures consistency in the monitoring practices used. This includes setting up comprehensive guidelines covering these aspects that must be followed when new features are developed or existing features are modified.
Benefits: It will speed up the time to find the issue and will help on-call teams to work on their own without calling additional people to help.
Problem: Ambiguity in escalation procedures can lead to delays and miscommunications during critical incidents, exacerbating the situation.
Approach: Clearly defining who should be contacted and when during an incident ensures that there is no confusion during high-stress periods. This includes having contact information, roles, and responsibilities clearly documented and easily accessible.
Benefits: Increases the efficiency of incident handling by reducing the time spent figuring out who to contact, thus speeding up the escalation process when necessary. It also helps to avoid calling people who are not involved in the development of a specific service.
Problem: Lack of understanding among on-call team members about new or updated services can lead to confusion and errors during incident management, and in such cases, they will definitely need help from people who developed the system.
Approach: Developers should regularly share knowledge about the architecture of new services, as well as any interactions with other services and databases. This can be facilitated through workshops, documentation, or regular meetings.
Benefits: Improves the on-call team's knowledge and preparedness for handling incidents with new services, making issue resolution faster and more accurate. This will also allow the on-call team to work on their own without involving others.
Problem: Frequent escalations can occur due to on-call engineers not knowing immediately how to handle specific issues with services.
Approach: Develop detailed runbooks in collaboration with the development team, outlining clear steps for addressing common issues that might arise with each service. These runbooks should be regularly reviewed and updated.
Benefits: Well-crafted runbooks enable on-call engineers to resolve issues more independently and reduce the frequency of escalations, thereby decreasing stress and response times.
Problem: Often, staging environments contain features still under development, which can lead to inconsistencies between the staging and production environments. If a feature tested in such a staging environment is pushed to production, it may not perform as expected due to differences in the codebase.
Approach: Establish a pre-production environment that mirrors the production environment in terms of database structure and, ideally, data. This setup should closely replicate the production environment to ensure that any deployments have been thoroughly tested under conditions that match production.
Benefits: This practice minimizes the risk of deploying features with bugs, thereby reducing the urgency to either revert or fix issues under pressure, maintaining stability in production environments.
Problem: Frequent and large-scale deployments can overwhelm systems and teams, leading to increased risk of failures and on-call stress.
Approach: By controlling the frequency of deployments and the scope of changes in each deployment, organizations can reduce the risk of introducing errors and the operational load on the on-call team.
Benefits: Fewer changes per deploy mean a lower chance of something going wrong and a smaller set of changes to troubleshoot in the event of an issue, which simplifies incident management.
Problem: A common issue is that developers called in to assist with incidents during the day are expected to return to their regular work immediately after, despite the stress involved, and those called during nights or weekends do not receive adequate time off afterwards.
Approach: While there is an understanding of on-call stress handling for on-call teams, developers should be also overlooked as well, as it's a bit obscure to understand the burden of even occasional calls during off-hours, and especially nights. It's important to understand that even infrequent cases provide a high burden on everyone and this should be managed. Ensure that developers who are called upon to address critical issues have allocated time to recover following their intervention. If the incident occurs during normal working hours, provide some downtime afterwards. If it occurs during nights or weekends, allow a couple of days off as compensation.
Benefits: Granting recovery time not only aids in preserving their mental health and productivity but also underscores the importance of addressing the stress they experience. This practice ensures developers are well-rested, more engaged in their tasks, and helps prevent burnout and turnover.
Implementing these practices not only helps in reducing the number of escalations to the development team but also ensures a more efficient and less stressful working environment for everyone involved. By proactively addressing potential issues and establishing clear guidelines, we can create more resilient and responsive systems.
These strategies not only enhance productivity but also contribute to a healthier work environment, ensuring that both teams can perform at their best and deliver high-quality services consistently.