This article is the first part of our blog series on effective observability strategies. Be sure to read the second part here.
I should start this article by stating the obvious: In today's fast-paced digital world, the efficiency and reliability of our technological infrastructures have become paramount. As we weave through the complex web of software and systems that support our daily operations, the role of monitoring these systems escalates from a mere task to a critical necessity. However, there's a gaping chasm between simply monitoring and monitoring with a purpose.
While it's widely recognized among engineers that monitoring application functionality - with a focus on the user experience - is paramount, achieving comprehensive coverage that mirrors the user experience can be daunting. Often, observability efforts start strong, focusing on crucial user-oriented functionalities. Yet, the journey to deep observability demands more; it requires a thorough exploration beyond these surface-level checks. This isn't just about technical diligence but about embracing observability as a fundamental part of understanding and improving the user journey.
The truth is, many engineering teams start with the best intentions, aiming to monitor their applications comprehensively. However, they might find themselves constrained by immediate priorities or limited by the initial scope of their monitoring tools. This isn't a failure on their part but a reflection of the challenges inherent in balancing development speed with thorough, user-centric observability.
Deepening observability requires a shift towards system analysis, an in-depth understanding of user flows, and a collaborative effort between project management and engineering. In practice, this means not just ticking off boxes for monitoring basic functionalities but engaging in continuous dialogue about what the users are experiencing. Observability is now a software engineering process. It involves engineers, product managers, and observability specialists coming together to map out the user journey in its entirety and identifying the touchpoints where observability can offer the most significant insights and impact.
In this series of articles, I aim to explore the "ideal" vision of how an observability ecosystem should be implemented for a product. Given the constant changes and the addition of new functionalities inherent in the agile nature of modern product development, achieving this ideal scenario might not be entirely feasible. However, by envisioning an ideal strategy, we can compare it to our current state and make necessary adaptations. I believe that the foundation of implementing such an ecosystem begins with a clear understanding of its purpose and objectives. Therefore, I will start with a checklist of what I believe should be included in this ecosystem, which tools could facilitate these goals, and the rationale behind each implementation area.
Two critical top-level things that we really want to be monitored are:
Application Health and User Experience Monitoring
Ensuring the application works for users involves checks that replicate browser actions performed by users to verify results. This includes verifying the ability of users to log in, ensuring typical user journeys are functioning correctly, and specifically verifying the completion of essential operations within the application.
Objective: Ensure the application's main functionality works as expected for users.
Who Provides Metrics: Product management team defines critical user journeys and functionalities.
Monitoring Tools: Synthetic monitoring (browser checks), Real User Monitoring (RUM).
What to Monitor: Availability, functionality, and performance of key user journeys and features.
Coverage Checklist:
Verify user ability to log in and navigate critical paths.
Check load times for key pages and actions.
Monitor for JavaScript errors on critical user journey paths.
Verify critical functionalities, like adding a friend on a social network, are working as intended.
Alerts Checklist:
Set alerts for login failures or significant increases in login time.
Alert on increases in page load times beyond a predefined threshold.
Monitor and alert for an increase in JavaScript errors on critical paths.
Set up alerts for failures or significant performance issues in critical functionalities, such as errors during or failure to complete specific actions like adding a friend.
Business Impact Analysis
Information about product issues sometimes originates from the business team, indicated by changes such as conversion drops or revenue decreases. Recognizing these issues promptly allows for a more measured, less stressful response, involving both the product and tech teams in the investigation.
Objective: Ensure that critical business processes are within the accepted range, even if crucial application monitoring shows no errors.
Who Provides Metrics: Business analysts in collaboration with the product management team.
Monitoring Tools: Business intelligence (BI) tools, APM integrated with business analytics, custom dashboards.
What to Monitor: Changes in business metrics, patterns that indicate a technical issue is affecting user behavior or business outcomes.
Coverage Checklist:
Regular checks on conversion rates and other key performance indicators (KPIs).
Alerts for sudden changes in revenue or other critical business metrics.
Analysis of traffic sources and user engagement metrics for anomalies.
Coordination between technical and business teams to correlate tech issues with business metrics changes.
Alerts Checklist:
Set alerts for significant drops or changes in conversion rates.
Monitor and alert on unexpected shifts in revenue or other critical business metrics.
Alert on anomalies in traffic sources and user engagement metrics that deviate from historical patterns.
Implement cross-functional alerts to notify both technical and business teams of potential issues detected through monitoring.
Transitioning into a deeper dive within our ecosystem, it's essential to first discern whether challenges arise from the frontend or the frontend-facing backend APIs.
Frontend or Backend Issue Localization
Objective: Determine whether issues are occurring in the frontend or frontend-facing backend API.
Who Provides Metrics: Frontend and backend development teams.
Monitoring Tools: Synthetic monitoring (browser checks), Real User Monitoring (RUM), Frontend-Facing APIs, consider employing tools such as AlertSite, RunScope, Pingdom, or similar.
Coverage Checklist:
Ensure all Frontend-Facing APIs have uptime and performance monitoring.
Set up browser error logging to capture JavaScript and resource loading errors.
Implement frontend performance monitoring to track page load times and user interactions.
Alerts Checklist:
Alert on downtime or performance degradation of Frontend-Facing APIs.
Set up alerts for critical browser-level errors or a spike in error rates.
Monitor and alert on significant changes in frontend performance metrics, like increased page load times.
If the issue is located in the frontend, it's now up to the deployment team to revert to a previous working version, disable the feature with a feature flag, or up to the frontend team to provide a fix.
If something is happening to a front-facing backend API, in an ideal world, the next step is going to be about final localization of the underlying issue. In case the whole application is covered with proper error logging and tracing, using those systems will enable identification of the issue's location and facilitate work on the fix. However, in many situations, there is not enough information, but we'll cover this in the next steps.
Backend Issue Tracing
Objective: Identify the exact location and cause of backend errors, providing a trace to the issue's origin when the entire service stack is monitored.
Who Provides Metrics: Backend development and operations teams.
Monitoring Tools: Distributed tracing tools (such as Jaeger, Zipkin), log aggregation tools (such as ELK stack, Splunk), and Application Performance Monitoring (APM) tools.
What to Monitor:
Endpoint performance and error rates to identify failing backend services.
Detailed error logs and exception tracking to understand the nature of backend issues.
Distributed traces of requests to follow the path of failure through the service stack.
Coverage Checklist:
Ensure all backend services and endpoints are included in distributed tracing.
Set up detailed logging for error and exception tracking in all backend services.
Monitor backend services' performance metrics for signs of degradation or failures.
Alerts Checklist:
Alert on critical errors or exceptions in backend services that could indicate failures.
Set thresholds for performance degradation alerts to catch issues before they impact users.
Use anomaly detection on traces and logs to automatically alert on unusual patterns indicating hidden issues.
However, in the majority of cases, it's not like that. Product development, especially for a profitable product in an agile environment, moves at a fast pace, and it's really easy for development teams to overlook corner cases. There might be no tracing implemented, and a microservice, instead of properly providing an exception, can just return no information while providing HTTP 200. In such cases, instead of relying solely on tracing, we need to be able to track the issue with service-specific monitoring.
Service Failure Detection
Objective: Detect a specific place in the application where the issue originated.
Who Provides Metrics: Backend development and operations teams.
Monitoring Tools: Utilize tools that enable executing custom requests against backend microservice APIs for monitoring, such as Postman for manual testing and automated scripts, Prometheus for collecting metrics and alerting based on those metrics, and Grafana for creating dashboards that visualize the data.
What to Monitor:
Error rates and types by service.
Service response times and outliers.
Coverage Checklist:
Implement detailed logging for all backend services, capturing both standard operations and error conditions.
Include health checks for all services, ensuring that each service can report its health status at any given time.
Alerts Checklist:
Set up alerts for any increase in error rates across services, distinguishing between critical errors and warnings.
Alert on any service response times that exceed predefined thresholds, indicating potential performance issues.
Once we've identified the service in question, obtaining detailed information about the potential cause of an outage or performance degradation is crucial. This issue might stem from the degradation of the underlying infrastructure of the service (which we'll cover in the next step), or it's often related to interactions with other components such as other microservices within the application (as addressed by Service Failure Detection), the service's connections to databases, caches, and other storages, and its connections to external services' APIs in the outer world. At this stage, profiling the service to pinpoint the exact failing functionality becomes essential.
Service Profiling and Dependency Analysis
Objective: Acquire detailed profiling metrics of the service to identify performance degradation and log stuck or interrupted connections to other services.
Who Provides Metrics: Development team.
Monitoring Tools: Application Performance Management (APM) tools, network performance monitoring tools, and service mesh observability tools.
What to Monitor: Service's execution time for various functions, error rates, and failure points in interactions with dependent services.
Coverage Checklist:
Profile and monitor key service functionalities to detect performance bottlenecks.
Track service interaction times with databases, caches, and other internal storage systems to identify delays.
Monitor connectivity and response times to external APIs, highlighting timeout issues or abnormal latencies.
Utilize service mesh tools to visualize and monitor the flow of requests through microservices architecture, identifying failing services.
Alerts Checklist:
Set up alerts for significant deviations in service execution times from their baseline values.
Alert on error rates exceeding predefined thresholds in service functionalities or in interactions with dependencies.
Implement alerts for timeouts or significant latencies in connections to databases, caches, external APIs, and other microservices.
Now, and only now, we're delving into an area that, in many observability ecosystems, is often the only one comprehensively covered: server/cloud infrastructure, which serves as the foundation for the entire product. While the majority of previous scenarios (application-level issues, problems with services' interconnection, frontend issues) may not show any outliers in infrastructure level metrics - often, in fact, showing reduced load - many aspects related to application issues stem from infrastructure performance degradation.
Infrastructure Monitoring
Objective: Identify and distinguish between issues originating from the application layer and those stemming from the underlying infrastructure.
Who Provides Metrics: Infrastructure and operations teams.
Monitoring Tools: Infrastructure monitoring tools such as Prometheus, Grafana, Datadog, alongside cloud provider's native monitoring tools.
What to Monitor:
Container orchestration metrics (e.g., Kubernetes cluster health, pod statuses).
Server metrics, including CPU usage, memory utilization, disk I/O operations.
Network metrics, such as bandwidth usage, latency, packet loss.
Health and availability of cloud services and other critical infrastructure components.
Coverage Checklist:
Implement continuous monitoring of all physical and virtual servers' vital metrics.
Monitor the health and performance of container orchestration systems to ensure optimal deployment and scaling of applications.
Keep a close watch on network throughput and errors to identify potential bottlenecks or failures affecting application connectivity.
Ensure cloud services and infrastructure components are monitored for availability and performance issues.
Alerts Checklist:
Set up threshold-based alerts for critical server metrics (CPU, memory, disk I/O) to quickly identify when resources are overstressed.
Configure alerts for container orchestration health issues, including failed deployments or unhealthy pods.
Establish network performance alerts to notify teams of potential connectivity issues or degradation in network quality.
Implement alerts for cloud service downtimes or performance degradations that could impact application availability or performance.
These steps, in my view, construct an ideal observability ecosystem that I'd like to see. "Ideal" here acknowledges the myriad reasons why this vision might not be easily achieved, yet it serves as a guiding "north star" metric that a company may aspire to reach. Nonetheless, there's an additional crucial step. Despite all efforts, there are instances when application issues slip through unnoticed. In such moments, our users reach out to report performance issues, failures, and other problems. While the support team will manage these interactions, having engineers aware of a spike in such reports from the onset allows for initiating investigations before user dissatisfaction proliferates.
User Feedback Integration
Objective: Leverage user feedback to identify and prioritize issues not detected by automated monitoring systems.
Who Provides Metrics: Customer support and product management teams.
Monitoring Tools: Customer feedback platforms, support ticket systems integrated with monitoring tools, sentiment analysis tools for social media and forums.
What to Monitor:
Volume and nature of user-reported issues.
Sentiment analysis results from user feedback across various channels.
Correlation between user feedback and metrics captured by monitoring systems.
Coverage Checklist:
Implement a system for categorizing and quantifying user-reported issues for trend analysis.
Integrate user feedback from multiple sources, including support tickets, social media, and direct user feedback, into a centralized dashboard for easy monitoring.
Use sentiment analysis tools to gauge user sentiment across different platforms, identifying potential issues from user discussions.
Establish processes for correlating spikes in user-reported issues with data from monitoring tools to identify potential root causes.
Alerts Checklist:
Set up real-time alerts for abnormal increases in user-reported issues, enabling swift response to emerging problems.
Configure alerts for significant shifts in sentiment analysis metrics, indicating changing user perceptions that might not yet be reflected in support tickets.
Implement cross-functional alerts to ensure that spikes in user feedback are communicated to both technical and customer support teams, fostering a collaborative approach to problem resolution.
In this introductory article, I've outlined the foundational philosophy behind a purpose-oriented observability ecosystem, emphasizing the critical role it plays in aligning monitoring efforts with user experience and application health.
Further, I will delve into the technical specifics of implementing each discussed step, providing practical examples and case studies. These insights, I hope, will serve as a valuable reference for those looking to refine their observability strategies, ensuring they are equipped with the knowledge to build a system that not only monitors but truly understands and improves the user journey.
At ApexData, our mission is to empower companies to embrace this purpose-oriented approach to observability. We believe in creating a system that simplifies the complex, making it easier for teams to implement a robust observability framework that aligns with their operational goals and user expectations.
As we continue to explore the depths of effective observability practices, we invite you to join me on this journey. To get firsthand experience with our cutting-edge solutions and to contribute to the evolution of observability, We encourage you to sign up for our private beta. Join me in shaping the future of observability, where every insight is driven by purpose, and every monitoring effort enhances the user experience.