
Claude Code Workshop & Best Practices Speaker: Evgeny Potapov, ApexData co-founder & CEO
Evgeny Potapov, CEO, co-founder, (on LinkedIn, and on X.com)

This is Part 1 of our observability series. Read Part 2 here.
Most teams start monitoring with good intentions but stop at surface-level checks — uptime pings, basic error rates, maybe a few dashboards. Real observability goes deeper: it requires understanding user flows and collaboration between product, engineering, and operations.
This article outlines what I consider an ideal observability ecosystem. You probably can't build all of it at once, but it serves as a target to measure your gaps against.
Two top-level things to monitor:
Application Health and User Experience
Checks that replicate real browser actions to verify the application works for users — login flows, critical user journeys, and completion of essential operations.
Objective: Ensure the application's main functionality works as expected for users.
Who Provides Metrics: Product management defines critical user journeys.
Monitoring Tools: Synthetic monitoring (browser checks), Real User Monitoring (RUM).
What to Monitor: Availability, functionality, and performance of key user journeys.
Coverage Checklist:
Verify user login and critical path navigation.
Check load times for key pages and actions.
Monitor JavaScript errors on critical paths.
Verify critical functionalities (e.g., adding a friend on a social network) work as intended.
Alerts Checklist:
Login failures or significant increases in login time.
Page load times beyond a predefined threshold.
Increase in JavaScript errors on critical paths.
Failures or performance degradation in critical functionalities.
Business Impact Analysis
Product issues sometimes surface through the business team first — conversion drops, revenue decreases. Catching these early allows a calmer, more coordinated response between product and engineering.
Objective: Ensure critical business processes stay within accepted ranges, even when application monitoring shows no errors.
Who Provides Metrics: Business analysts with product management.
Monitoring Tools: BI tools, APM integrated with business analytics, custom dashboards.
What to Monitor: Changes in business metrics and patterns indicating a technical issue.
Coverage Checklist:
Regular checks on conversion rates and KPIs.
Alerts for sudden changes in revenue or critical business metrics.
Traffic source and engagement anomaly analysis.
Coordination between tech and business teams to correlate issues with metric changes.
Alerts Checklist:
Significant drops in conversion rates.
Unexpected shifts in revenue or critical business metrics.
Traffic and engagement anomalies deviating from historical patterns.
Cross-functional alerts to both technical and business teams.
Once top-level monitoring is in place, determine whether issues come from the frontend or the backend.
Frontend or Backend Issue Localization
Objective: Determine whether issues are in the frontend or frontend-facing backend API.
Who Provides Metrics: Frontend and backend development teams.
Monitoring Tools: Synthetic monitoring, RUM, frontend-facing API monitors (AlertSite, RunScope, Pingdom, etc.).
Coverage Checklist:
All frontend-facing APIs have uptime and performance monitoring.
Browser error logging captures JavaScript and resource loading errors.
Frontend performance monitoring tracks page load times and interactions.
Alerts Checklist:
Downtime or performance degradation of frontend-facing APIs.
Critical browser-level errors or spikes in error rates.
Significant changes in frontend performance metrics.
If the issue is in the frontend: revert to a previous version, disable the feature with a flag, or have the frontend team provide a fix.
If the issue is in a frontend-facing backend API and your application has proper error logging and tracing, those systems will locate it. In many cases though, there isn't enough information — that's what the next steps address.
Backend Issue Tracing
Objective: Identify the exact location and cause of backend errors when the service stack is fully instrumented.
Who Provides Metrics: Backend development and operations teams.
Monitoring Tools: Distributed tracing (Jaeger, Zipkin), log aggregation (ELK stack, Splunk), APM tools.
What to Monitor:
Endpoint performance and error rates.
Detailed error logs and exception tracking.
Distributed traces following the failure path through the service stack.
Coverage Checklist:
All backend services included in distributed tracing.
Detailed error and exception logging in all services.
Performance metrics monitored for degradation.
Alerts Checklist:
Critical errors or exceptions in backend services.
Performance degradation thresholds to catch issues before user impact.
Anomaly detection on traces and logs for unusual patterns.
In practice, full tracing is rare. Product development moves fast, teams overlook corner cases, and a microservice might return HTTP 200 with no useful information instead of a proper exception. When tracing falls short, you need service-specific monitoring.
Service Failure Detection
Objective: Detect the specific place in the application where the issue originated.
Who Provides Metrics: Backend development and operations teams.
Monitoring Tools: Custom request execution against backend APIs (Postman, automated scripts), Prometheus, Grafana.
What to Monitor:
Error rates and types by service.
Service response times and outliers.
Coverage Checklist:
Detailed logging for all backend services, covering normal operations and error conditions.
Health checks for all services.
Alerts Checklist:
Increases in error rates, distinguishing critical errors from warnings.
Response times exceeding predefined thresholds.
Once you've identified the failing service, you need to understand why. The issue might stem from infrastructure degradation (covered next), interactions with other microservices, database/cache connections, or external API dependencies.
Service Profiling and Dependency Analysis
Objective: Get detailed profiling metrics to identify performance degradation and log failing connections to dependencies.
Who Provides Metrics: Development team.
Monitoring Tools: APM tools, network performance monitoring, service mesh observability.
What to Monitor: Execution time for key functions, error rates, and failure points in dependency interactions.
Coverage Checklist:
Profile key service functionalities to detect bottlenecks.
Track interaction times with databases, caches, and internal storage.
Monitor connectivity and response times to external APIs.
Use service mesh tools to visualize request flow through microservices.
Alerts Checklist:
Significant deviations in execution times from baseline.
Error rates exceeding thresholds in service functionalities or dependency interactions.
Timeouts or significant latencies in connections to databases, caches, external APIs, and other microservices.
Now we reach the layer that's most commonly the only one covered well: server and cloud infrastructure. While most application-level and service interconnection issues won't show outliers in infrastructure metrics (often showing reduced load), many application issues do stem from infrastructure degradation.
Infrastructure Monitoring
Objective: Distinguish between application-layer issues and infrastructure issues.
Who Provides Metrics: Infrastructure and operations teams.
Monitoring Tools: Prometheus, Grafana, Datadog, cloud provider native monitoring.
What to Monitor:
Container orchestration metrics (Kubernetes cluster health, pod statuses).
Server metrics: CPU, memory, disk I/O.
Network metrics: bandwidth, latency, packet loss.
Cloud service availability.
Coverage Checklist:
Continuous monitoring of all server metrics.
Container orchestration health and performance.
Network throughput and error monitoring.
Cloud service availability and performance.
Alerts Checklist:
Threshold-based alerts for CPU, memory, disk I/O.
Container orchestration issues (failed deployments, unhealthy pods).
Network performance degradation.
Cloud service downtimes or performance issues.
Despite all these layers, some issues will slip through. When they do, users report them. Having engineers aware of support ticket spikes early lets investigation begin before the problem spreads.
User Feedback Integration
Objective: Use user feedback to catch issues missed by automated monitoring.
Who Provides Metrics: Customer support and product management.
Monitoring Tools: Customer feedback platforms, support ticket systems, sentiment analysis.
What to Monitor:
Volume and nature of user-reported issues.
Sentiment across feedback channels.
Correlation between user feedback and monitoring metrics.
Coverage Checklist:
Categorize and quantify user-reported issues for trend analysis.
Centralize feedback from support tickets, social media, and direct channels.
Use sentiment analysis to gauge user perception.
Correlate spikes in user reports with monitoring data.
Alerts Checklist:
Abnormal increases in user-reported issues.
Significant shifts in sentiment metrics.
Cross-functional alerts to both technical and support teams.
These steps form what I'd consider an ideal observability ecosystem — a north star to aim for, even if you can't build it all at once.
In the next article, I'll cover why observability needs to be part of your development process and what that actually costs in time.

Claude Code Workshop & Best Practices Speaker: Evgeny Potapov, ApexData co-founder & CEO

Ten practices to reduce incident escalations from on-call teams to developers.

The OpenTelemetry Demo shows 30–45% of code is observability instrumentation. Plan your development timelines accordingly.
