Building an Effective Observability Strategy: A Comprehensive Checklist - Part 1

Evgeny Potapov

This article is the first part of our blog series on effective observability strategies. Be sure to read the second part here.

I should start this article by stating the obvious: In today's fast-paced digital world, the efficiency and reliability of our technological infrastructures have become paramount. As we weave through the complex web of software and systems that support our daily operations, the role of monitoring these systems escalates from a mere task to a critical necessity. However, there's a gaping chasm between simply monitoring and monitoring with a purpose.

While it's widely recognized among engineers that monitoring application functionality - with a focus on the user experience - is paramount, achieving comprehensive coverage that mirrors the user experience can be daunting. Often, observability efforts start strong, focusing on crucial user-oriented functionalities. Yet, the journey to deep observability demands more; it requires a thorough exploration beyond these surface-level checks. This isn't just about technical diligence but about embracing observability as a fundamental part of understanding and improving the user journey.

The truth is, many engineering teams start with the best intentions, aiming to monitor their applications comprehensively. However, they might find themselves constrained by immediate priorities or limited by the initial scope of their monitoring tools. This isn't a failure on their part but a reflection of the challenges inherent in balancing development speed with thorough, user-centric observability.

Deepening observability requires a shift towards system analysis, an in-depth understanding of user flows, and a collaborative effort between project management and engineering. In practice, this means not just ticking off boxes for monitoring basic functionalities but engaging in continuous dialogue about what the users are experiencing. Observability is now a software engineering process. It involves engineers, product managers, and observability specialists coming together to map out the user journey in its entirety and identifying the touchpoints where observability can offer the most significant insights and impact.

In this series of articles, I aim to explore the "ideal" vision of how an observability ecosystem should be implemented for a product. Given the constant changes and the addition of new functionalities inherent in the agile nature of modern product development, achieving this ideal scenario might not be entirely feasible. However, by envisioning an ideal strategy, we can compare it to our current state and make necessary adaptations. I believe that the foundation of implementing such an ecosystem begins with a clear understanding of its purpose and objectives. Therefore, I will start with a checklist of what I believe should be included in this ecosystem, which tools could facilitate these goals, and the rationale behind each implementation area.

Two critical top-level things that we really want to be monitored are:

Application Health and User Experience Monitoring

Ensuring the application works for users involves checks that replicate browser actions performed by users to verify results. This includes verifying the ability of users to log in, ensuring typical user journeys are functioning correctly, and specifically verifying the completion of essential operations within the application.

Business Impact Analysis

Information about product issues sometimes originates from the business team, indicated by changes such as conversion drops or revenue decreases. Recognizing these issues promptly allows for a more measured, less stressful response, involving both the product and tech teams in the investigation.

Transitioning into a deeper dive within our ecosystem, it's essential to first discern whether challenges arise from the frontend or the frontend-facing backend APIs.

Frontend or Backend Issue Localization

If the issue is located in the frontend, it's now up to the deployment team to revert to a previous working version, disable the feature with a feature flag, or up to the frontend team to provide a fix.

If something is happening to a front-facing backend API, in an ideal world, the next step is going to be about final localization of the underlying issue. In case the whole application is covered with proper error logging and tracing, using those systems will enable identification of the issue's location and facilitate work on the fix. However, in many situations, there is not enough information, but we'll cover this in the next steps.

Backend Issue Tracing

However, in the majority of cases, it's not like that. Product development, especially for a profitable product in an agile environment, moves at a fast pace, and it's really easy for development teams to overlook corner cases. There might be no tracing implemented, and a microservice, instead of properly providing an exception, can just return no information while providing HTTP 200. In such cases, instead of relying solely on tracing, we need to be able to track the issue with service-specific monitoring.

Service Failure Detection

Once we've identified the service in question, obtaining detailed information about the potential cause of an outage or performance degradation is crucial. This issue might stem from the degradation of the underlying infrastructure of the service (which we'll cover in the next step), or it's often related to interactions with other components such as other microservices within the application (as addressed by Service Failure Detection), the service's connections to databases, caches, and other storages, and its connections to external services' APIs in the outer world. At this stage, profiling the service to pinpoint the exact failing functionality becomes essential.

Service Profiling and Dependency Analysis

Now, and only now, we're delving into an area that, in many observability ecosystems, is often the only one comprehensively covered: server/cloud infrastructure, which serves as the foundation for the entire product. While the majority of previous scenarios (application-level issues, problems with services' interconnection, frontend issues) may not show any outliers in infrastructure level metrics - often, in fact, showing reduced load - many aspects related to application issues stem from infrastructure performance degradation.

Infrastructure Monitoring

These steps, in my view, construct an ideal observability ecosystem that I'd like to see. "Ideal" here acknowledges the myriad reasons why this vision might not be easily achieved, yet it serves as a guiding "north star" metric that a company may aspire to reach. Nonetheless, there's an additional crucial step. Despite all efforts, there are instances when application issues slip through unnoticed. In such moments, our users reach out to report performance issues, failures, and other problems. While the support team will manage these interactions, having engineers aware of a spike in such reports from the onset allows for initiating investigations before user dissatisfaction proliferates.

User Feedback Integration

In this introductory article, I've outlined the foundational philosophy behind a purpose-oriented observability ecosystem, emphasizing the critical role it plays in aligning monitoring efforts with user experience and application health.

Further, I will delve into the technical specifics of implementing each discussed step, providing practical examples and case studies. These insights, I hope, will serve as a valuable reference for those looking to refine their observability strategies, ensuring they are equipped with the knowledge to build a system that not only monitors but truly understands and improves the user journey.

At ApexData, our mission is to empower companies to embrace this purpose-oriented approach to observability. We believe in creating a system that simplifies the complex, making it easier for teams to implement a robust observability framework that aligns with their operational goals and user expectations.

As we continue to explore the depths of effective observability practices, we invite you to join me on this journey. To get firsthand experience with our cutting-edge solutions and to contribute to the evolution of observability, We encourage you to sign up for our private beta. Join me in shaping the future of observability, where every insight is driven by purpose, and every monitoring effort enhances the user experience.

Subscribe to our blog to get the latest updates