This article is the first part of our blog series on effective observability strategies. Be sure to read the second part here.
I should start this article by stating the obvious: In today's
fast-paced digital world, the efficiency and reliability of our
technological infrastructures have become paramount. As we weave through
the complex web of software and systems that support our daily
operations, the role of monitoring these systems escalates from a mere
task to a critical necessity. However, there's a gaping chasm between
simply monitoring and monitoring with a purpose.
While it's widely recognized among engineers that monitoring
application functionality - with a focus on the user experience - is
paramount, achieving comprehensive coverage that mirrors the user
experience can be daunting. Often, observability efforts start strong,
focusing on crucial user-oriented functionalities. Yet, the journey to
deep observability demands more; it requires a thorough exploration
beyond these surface-level checks. This isn't just about technical
diligence but about embracing observability as a fundamental part of
understanding and improving the user journey.
The truth is, many engineering teams start with the best intentions,
aiming to monitor their applications comprehensively. However, they
might find themselves constrained by immediate priorities or limited by
the initial scope of their monitoring tools. This isn't a failure on
their part but a reflection of the challenges inherent in balancing
development speed with thorough, user-centric observability.
Deepening observability requires a shift towards system analysis, an
in-depth understanding of user flows, and a collaborative effort between
project management and engineering. In practice, this means not just
ticking off boxes for monitoring basic functionalities but engaging in
continuous dialogue about what the users are experiencing. Observability
is now a software engineering process. It involves engineers, product
managers, and observability specialists coming together to map out the
user journey in its entirety and identifying the touchpoints where
observability can offer the most significant insights and impact.
In this series of articles, I aim to explore the "ideal" vision of how
an observability ecosystem should be implemented for a product. Given
the constant changes and the addition of new functionalities inherent in
the agile nature of modern product development, achieving this ideal
scenario might not be entirely feasible. However, by envisioning an
ideal strategy, we can compare it to our current state and make
necessary adaptations. I believe that the foundation of implementing
such an ecosystem begins with a clear understanding of its purpose and
objectives. Therefore, I will start with a checklist of what I believe
should be included in this ecosystem, which tools could facilitate these
goals, and the rationale behind each implementation area.
Two critical top-level things that we really want to be monitored
are:
Application Health and User Experience Monitoring
Ensuring the application works for users involves checks that replicate
browser actions performed by users to verify results. This includes
verifying the ability of users to log in, ensuring typical user journeys
are functioning correctly, and specifically verifying the completion of
essential operations within the application.
-
Objective: Ensure the application's main functionality works as
expected for users.
-
Who Provides Metrics: Product management team defines critical
user journeys and functionalities.
-
Monitoring Tools: Synthetic monitoring (browser checks), Real
User Monitoring (RUM).
-
What to Monitor: Availability, functionality, and performance of
key user journeys and features.
-
Coverage Checklist:
-
Verify user ability to log in and navigate critical paths.
-
Check load times for key pages and actions.
-
Monitor for JavaScript errors on critical user journey paths.
-
Verify critical functionalities, like adding a friend on a
social network, are working as intended.
-
Alerts Checklist:
-
Set alerts for login failures or significant increases in login
time.
-
Alert on increases in page load times beyond a predefined
threshold.
-
Monitor and alert for an increase in JavaScript errors on
critical paths.
-
Set up alerts for failures or significant performance issues in
critical functionalities, such as errors during or failure to
complete specific actions like adding a friend.
Business Impact Analysis
Information about product issues sometimes originates from the business
team, indicated by changes such as conversion drops or revenue
decreases. Recognizing these issues promptly allows for a more measured,
less stressful response, involving both the product and tech teams in
the investigation.
-
Objective: Ensure that critical business processes are within
the accepted range, even if crucial application monitoring shows no
errors.
-
Who Provides Metrics: Business analysts in collaboration with
the product management team.
-
Monitoring Tools: Business intelligence (BI) tools, APM
integrated with business analytics, custom dashboards.
-
What to Monitor: Changes in business metrics, patterns that
indicate a technical issue is affecting user behavior or business
outcomes.
-
Coverage Checklist:
-
Regular checks on conversion rates and other key performance
indicators (KPIs).
-
Alerts for sudden changes in revenue or other critical business
metrics.
-
Analysis of traffic sources and user engagement metrics for
anomalies.
-
Coordination between technical and business teams to correlate
tech issues with business metrics changes.
-
Alerts Checklist:
-
Set alerts for significant drops or changes in conversion rates.
-
Monitor and alert on unexpected shifts in revenue or other
critical business metrics.
-
Alert on anomalies in traffic sources and user engagement
metrics that deviate from historical patterns.
-
Implement cross-functional alerts to notify both technical and
business teams of potential issues detected through monitoring.
Transitioning into a deeper dive within our ecosystem, it's essential
to first discern whether challenges arise from the frontend or the
frontend-facing backend APIs.
Frontend or Backend Issue Localization
-
Objective: Determine whether issues are occurring in the
frontend or frontend-facing backend API.
-
Who Provides Metrics: Frontend and backend development teams.
-
Monitoring Tools: Synthetic monitoring (browser checks), Real
User Monitoring (RUM), Frontend-Facing APIs, consider employing
tools such as AlertSite, RunScope, Pingdom, or similar.
-
Coverage Checklist:
-
Ensure all Frontend-Facing APIs have uptime and performance
monitoring.
-
Set up browser error logging to capture JavaScript and resource
loading errors.
-
Implement frontend performance monitoring to track page load
times and user interactions.
-
Alerts Checklist:
-
Alert on downtime or performance degradation of Frontend-Facing
APIs.
-
Set up alerts for critical browser-level errors or a spike in
error rates.
-
Monitor and alert on significant changes in frontend performance
metrics, like increased page load times.
If the issue is located in the frontend, it's now up to the deployment
team to revert to a previous working version, disable the feature with a
feature flag, or up to the frontend team to provide a fix.
If something is happening to a front-facing backend API, in an ideal
world, the next step is going to be about final localization of the
underlying issue. In case the whole application is covered with proper
error logging and tracing, using those systems will enable
identification of the issue's location and facilitate work on the fix.
However, in many situations, there is not enough information, but we'll
cover this in the next steps.
Backend Issue Tracing
-
Objective: Identify the exact location and cause of backend
errors, providing a trace to the issue's origin when the entire
service stack is monitored.
-
Who Provides Metrics: Backend development and operations teams.
-
Monitoring Tools: Distributed tracing tools (such as Jaeger,
Zipkin), log aggregation tools (such as ELK stack, Splunk), and
Application Performance Monitoring (APM) tools.
-
What to Monitor:
-
Endpoint performance and error rates to identify failing backend
services.
-
Detailed error logs and exception tracking to understand the
nature of backend issues.
-
Distributed traces of requests to follow the path of failure
through the service stack.
-
Coverage Checklist:
-
Ensure all backend services and endpoints are included in
distributed tracing.
-
Set up detailed logging for error and exception tracking in all
backend services.
-
Monitor backend services' performance metrics for signs of
degradation or failures.
-
Alerts Checklist:
-
Alert on critical errors or exceptions in backend services that
could indicate failures.
-
Set thresholds for performance degradation alerts to catch
issues before they impact users.
-
Use anomaly detection on traces and logs to automatically alert
on unusual patterns indicating hidden issues.
However, in the majority of cases, it's not like that. Product
development, especially for a profitable product in an agile
environment, moves at a fast pace, and it's really easy for development
teams to overlook corner cases. There might be no tracing implemented,
and a microservice, instead of properly providing an exception, can just
return no information while providing HTTP 200. In such cases, instead
of relying solely on tracing, we need to be able to track the issue with
service-specific monitoring.
Service Failure Detection
-
Objective: Detect a specific place in the application where the
issue originated.
-
Who Provides Metrics: Backend development and operations teams.
-
Monitoring Tools: Utilize tools that enable executing custom
requests against backend microservice APIs for monitoring, such as
Postman for manual testing and automated scripts, Prometheus for
collecting metrics and alerting based on those metrics, and Grafana
for creating dashboards that visualize the data.
-
What to Monitor:
-
Coverage Checklist:
-
Implement detailed logging for all backend services, capturing
both standard operations and error conditions.
-
Include health checks for all services, ensuring that each
service can report its health status at any given time.
-
Alerts Checklist:
-
Set up alerts for any increase in error rates across services,
distinguishing between critical errors and warnings.
-
Alert on any service response times that exceed predefined
thresholds, indicating potential performance issues.
Once we've identified the service in question, obtaining detailed
information about the potential cause of an outage or performance
degradation is crucial. This issue might stem from the degradation of
the underlying infrastructure of the service (which we'll cover in the
next step), or it's often related to interactions with other components
such as other microservices within the application (as addressed by
Service Failure Detection), the service's connections to databases,
caches, and other storages, and its connections to external services'
APIs in the outer world. At this stage, profiling the service to
pinpoint the exact failing functionality becomes essential.
Service Profiling and Dependency Analysis
-
Objective: Acquire detailed profiling metrics of the service to
identify performance degradation and log stuck or interrupted
connections to other services.
-
Who Provides Metrics: Development team.
-
Monitoring Tools: Application Performance Management (APM)
tools, network performance monitoring tools, and service mesh
observability tools.
-
What to Monitor: Service's execution time for various
functions, error rates, and failure points in interactions with
dependent services.
-
Coverage Checklist:
-
Profile and monitor key service functionalities to detect
performance bottlenecks.
-
Track service interaction times with databases, caches, and
other internal storage systems to identify delays.
-
Monitor connectivity and response times to external APIs,
highlighting timeout issues or abnormal latencies.
-
Utilize service mesh tools to visualize and monitor the flow of
requests through microservices architecture, identifying failing
services.
-
Alerts Checklist:
-
Set up alerts for significant deviations in service execution
times from their baseline values.
-
Alert on error rates exceeding predefined thresholds in service
functionalities or in interactions with dependencies.
-
Implement alerts for timeouts or significant latencies in
connections to databases, caches, external APIs, and other
microservices.
Now, and only now, we're delving into an area that, in many
observability ecosystems, is often the only one comprehensively covered:
server/cloud infrastructure, which serves as the foundation for the
entire product. While the majority of previous scenarios
(application-level issues, problems with services' interconnection,
frontend issues) may not show any outliers in infrastructure level
metrics - often, in fact, showing reduced load - many aspects related to
application issues stem from infrastructure performance degradation.
Infrastructure Monitoring
-
Objective: Identify and distinguish between issues originating
from the application layer and those stemming from the underlying
infrastructure.
-
Who Provides Metrics: Infrastructure and operations teams.
-
Monitoring Tools: Infrastructure monitoring tools such as
Prometheus, Grafana, Datadog, alongside cloud provider's native
monitoring tools.
-
What to Monitor:
-
Container orchestration metrics (e.g., Kubernetes cluster
health, pod statuses).
-
Server metrics, including CPU usage, memory utilization, disk
I/O operations.
-
Network metrics, such as bandwidth usage, latency, packet loss.
-
Health and availability of cloud services and other critical
infrastructure components.
-
Coverage Checklist:
-
Implement continuous monitoring of all physical and virtual
servers' vital metrics.
-
Monitor the health and performance of container orchestration
systems to ensure optimal deployment and scaling of
applications.
-
Keep a close watch on network throughput and errors to identify
potential bottlenecks or failures affecting application
connectivity.
-
Ensure cloud services and infrastructure components are
monitored for availability and performance issues.
-
Alerts Checklist:
-
Set up threshold-based alerts for critical server metrics (CPU,
memory, disk I/O) to quickly identify when resources are
overstressed.
-
Configure alerts for container orchestration health issues,
including failed deployments or unhealthy pods.
-
Establish network performance alerts to notify teams of
potential connectivity issues or degradation in network quality.
-
Implement alerts for cloud service downtimes or performance
degradations that could impact application availability or
performance.
These steps, in my view, construct an ideal observability ecosystem that
I'd like to see. "Ideal" here acknowledges the myriad reasons why this
vision might not be easily achieved, yet it serves as a guiding "north
star" metric that a company may aspire to reach. Nonetheless, there's
an additional crucial step. Despite all efforts, there are instances
when application issues slip through unnoticed. In such moments, our
users reach out to report performance issues, failures, and other
problems. While the support team will manage these interactions, having
engineers aware of a spike in such reports from the onset allows for
initiating investigations before user dissatisfaction proliferates.
User Feedback Integration
-
Objective: Leverage user feedback to identify and prioritize
issues not detected by automated monitoring systems.
-
Who Provides Metrics: Customer support and product management
teams.
-
Monitoring Tools: Customer feedback platforms, support ticket
systems integrated with monitoring tools, sentiment analysis tools
for social media and forums.
-
What to Monitor:
-
Volume and nature of user-reported issues.
-
Sentiment analysis results from user feedback across various
channels.
-
Correlation between user feedback and metrics captured by
monitoring systems.
-
Coverage Checklist:
-
Implement a system for categorizing and quantifying
user-reported issues for trend analysis.
-
Integrate user feedback from multiple sources, including support
tickets, social media, and direct user feedback, into a
centralized dashboard for easy monitoring.
-
Use sentiment analysis tools to gauge user sentiment across
different platforms, identifying potential issues from user
discussions.
-
Establish processes for correlating spikes in user-reported
issues with data from monitoring tools to identify potential
root causes.
-
Alerts Checklist:
-
Set up real-time alerts for abnormal increases in user-reported
issues, enabling swift response to emerging problems.
-
Configure alerts for significant shifts in sentiment analysis
metrics, indicating changing user perceptions that might not yet
be reflected in support tickets.
-
Implement cross-functional alerts to ensure that spikes in user
feedback are communicated to both technical and customer support
teams, fostering a collaborative approach to problem resolution.
In this introductory article, I've outlined the foundational philosophy
behind a purpose-oriented observability ecosystem, emphasizing the
critical role it plays in aligning monitoring efforts with user
experience and application health.
Further, I will delve into the technical specifics of implementing each
discussed step, providing practical examples and case studies. These
insights, I hope, will serve as a valuable reference for those looking
to refine their observability strategies, ensuring they are equipped
with the knowledge to build a system that not only monitors but truly
understands and improves the user journey.
At ApexData, our mission is to empower companies to embrace this
purpose-oriented approach to observability. We believe in creating a
system that simplifies the complex, making it easier for teams to
implement a robust observability framework that aligns with their
operational goals and user expectations.
As we continue to explore the depths of effective observability
practices, we invite you to join me on this journey. To get firsthand
experience with our cutting-edge solutions and to contribute to the
evolution of observability, We encourage you to sign up for our private
beta. Join me in shaping the future of observability, where every
insight is driven by purpose, and every monitoring effort enhances the
user experience.