An observability strategy is usually born when something broke that nobody had a graph for, and grows the same way: one alert per failure mode the team has already seen. The drift over a couple of years is predictable — the dashboards cover the parts of the system that have already failed, and the parts that have not are still invisible. The question this post is about is what an observability strategy would look like if you were designing it before any of those incidents, working down from the people the system is supposed to serve rather than up from the boxes it runs on.
This is Part 1 of three. Part 1 is the structure of the layers and what each one is supposed to answer. Part 2 is what that strategy costs in development time and why the cost is worth describing up front. Part 3 is the operational side — the practices that reduce the rate of escalations from on-call back into engineering once the strategy is in place.
You will not build all of this at once, and the parts you do build will probably arrive in the wrong order. The point is to have a target to measure your gaps against — to know which layers you are not watching, not to feel bad about missing any of them.
Two questions above anything else
The first thing to know about a production system is whether it is working for the people using it. Two signals answer that, and they answer different questions.
Synthetic monitoring and real user monitoring (RUM) tell you whether the critical user journeys still work. A synthetic check fires from a controlled environment on a schedule — log in, navigate to checkout, add an item, complete the flow — and records whether each step succeeds and how long it took. RUM, by contrast, instruments the application as real users use it: page load timings, JavaScript errors, the latency of XHR and fetch calls from inside the browser. The two are complementary. Synthetic checks tell you the application is operable; RUM tells you what your actual users are experiencing, which is the only signal that catches "fast for us, slow for everybody in São Paulo." Define the journeys with product, not engineering — engineering knows the routes, product knows which routes matter.
Business signals are the second answer, and they catch a class of incident the application layer cannot. Sometimes the application reports no errors at all and the first symptom is a drop in conversion or a flat revenue line — a checkout that completes but charges the wrong amount, a recommendation engine that started returning empty results, a referral path that quietly stopped tracking. Business dashboards owned by analysts and product, with alerts on the metrics the company is actually run on, surface those before anyone in engineering would have looked. The cross-functional handoff is the part to get right: a sudden conversion drop should page both the analytics team and on-call, because by the time it has been triaged in one of the two teams alone, the other one has lost an hour.
Both of these sit above the application code. That position is deliberate. By the time CPU is high on a host, the question of who is affected and how badly has already been answered somewhere up here, by people whose dashboard does not require reading the service map.
Frontend or backend?
Once a user-facing or business signal fires, the next question is whether to look at code that runs in the browser or code that runs behind an API. The triage is usually quick but worth making explicit, because the two have entirely different runbooks.
RUM is already telling you a lot. A spike in JavaScript exceptions, a page that started bundling a broken dependency, a third-party script that times out — these read directly off browser-side telemetry, and the fix is on the frontend team's bench. A frontend-facing API monitor running synthetic checks against the API the browser actually talks to confirms the other side: if those checks are red, the issue is server-side and you can stop reading the frontend dashboards.
The frontend playbook for a real failure is usually one of three moves: revert to the previous deploy, hide the broken feature behind a flag, or ship a fix. The choice is mostly about how confident you are about the cause and how long the fix will take. The backend playbook is longer, which is why most of the rest of this post is about it.
Tracing the request through the backend
When the issue is server-side, the next question is where in the request path the time went or the error happened. Distributed tracing answers that directly when it is in place: a request comes in, a trace is assembled from spans contributed by each service that handled it, and the span tree shows which service held the request for how long and which one returned the error.
Tools that do this — Jaeger, Zipkin, OpenTelemetry's collector, vendor APMs — all share a model. Each service emits spans with timing and metadata; a collector stitches them together using the trace context propagated through request headers. With tracing in place and every backend service participating, the answer to "what is wrong" is a few clicks: open the trace for a failing request, find the span that errored, read the attributes attached to it.
The honest caveat is that full tracing is rarer than the diagrams in the vendors' marketing imply. Product teams move fast, corner cases get overlooked, and the first service to skip propagating trace context is usually the one that turns out to matter. A microservice returning HTTP 200 with no body where it should have raised an exception is invisible to a tracing system that only marks spans as errored when the framework does. Detailed exception logging in each service, with stack traces and the request context that triggered the failure, is what fills the gap — and is also what most teams notice they are missing when an incident makes them try to follow the trace.
The realistic state of tracing in most production systems is partial. The realistic dependency on it, then, is also partial: tracing is the fastest path when the path is complete, and the layers below are where you go when it is not.
When tracing is incomplete: service-level monitoring
If a trace does not lead you to the failing service, you look at the services themselves. Per-service signals are coarser than per-trace ones, but they are easier to keep complete because they do not require every service in the path to cooperate.
The signals worth having for every service: a request rate, an error rate broken down by error class, a latency distribution (a histogram or quantile, not an average), and a health check. The RED method — rate, errors, duration — is shorthand for the same three. Errors deserve some structure: a 500 from a database timeout is not the same as a 500 from a downstream service returning a malformed response, and an alert that fires on "any 500s" will eventually be muted. Health checks tell you whether a service is up, which is different from whether it is healthy; a service can pass its check and still be quietly failing every third request, which is why the health check is necessary but never sufficient.
Once you have identified which service is failing, you need to know why. There are usually two directions to look. Inside the service: profile the slow paths, find the function that has gotten more expensive than it used to be, check whether a recent deploy changed the shape of a hot loop. Outside the service: the database, the cache, the queue, the downstream service, the external API. Profiling catches the inside ones; dependency monitoring catches the outside ones.
Dependency monitoring is the one that tends to be underbuilt. A service that talks to four downstream APIs, two databases, and a cache has six relationships to track, each with its own connection pool, its own timeout, and its own failure mode. A service mesh — Istio, Linkerd, Consul Connect — collects most of these metrics for free and shows them as edges in a service graph. Where there is no mesh, the same signals can come from the services themselves, recorded as per-dependency RED metrics on every outbound call. Either way, the question to be able to answer is: when this service is slow, is it slow on its own or because something it depends on is slow.
Infrastructure, last
Infrastructure monitoring — host CPU, memory, disk, network, container orchestration health — is usually the layer teams build first, because it is the easiest to instrument off the shelf and the most universally supported by tools like Prometheus and the cloud providers' native dashboards. It also tends to be the only layer some teams ever build, which is the reason this post puts it near the bottom rather than the top.
The reason to look at infrastructure last is that most application-level and service-interaction issues do not move the infrastructure metrics in the direction people expect them to. A service hanging on a downstream timeout is using less CPU than usual, not more, because it is sitting on a blocked socket. A database query that has lost its index is running on the same CPU as before; the workload moved from the application server to the database, and the application server's metrics improved. An out-of-memory kill is a real infrastructure event, but the cause is almost always in the application's allocation pattern, not the host's. The shape of incident where "the CPU graph is bent" is the first useful signal is rare enough that starting an investigation there will mislead the team most of the time.
That said, the layer matters. Container orchestration failures — failed deployments, pods stuck in CrashLoopBackOff, scheduling pressure on a node — are real, visible on infrastructure dashboards, and rarely visible anywhere else. Network issues are real and infrastructure-shaped: packet loss, bandwidth saturation, a DNS resolver that has gone slow. Disk-full conditions, NTP drift, kernel-level resource exhaustion — these all live here. The right way to use this layer is as the place to confirm or rule out, not the place to start: once the layers above have pointed at a service, infrastructure tells you whether the service is bound by something the host is doing to it.
The safety net: user feedback
The layers above cover most of the failure modes a careful strategy will catch, and miss some. The ones they miss have a shape: they are slow, they are localized, or they fail in a way the synthetic check did not exercise. By the time those reach the system, the people experiencing them are users, and the channel they are on is support.
The integration that matters is not a sentiment-analysis dashboard. It is a way for an engineer on call to see, before the next standup, that the rate of tickets containing the word "checkout" has tripled in the last hour. Three pieces make this work: a categorization of tickets that is consistent enough to be queried; a metric on top of that categorization that an alerting system can read; and an alert that fires to the same channel as the rest of on-call, not to a separate support queue that nobody is looking at outside business hours.
The point is not that users are a good telemetry source — they are slow and noisy. The point is that a category of incident will only ever be detectable through them, and the absence of an alert path from support back into engineering is a gap.
What this kind of strategy tends to look like
Reading the layers back, the strategy is shaped by where the questions come from and how quickly they have to be answered. The two top layers — user experience and business signals — are the ones a non-engineer can read at a glance, and they are the ones that catch user-affecting incidents fastest. The three middle layers — frontend/backend triage, tracing, service-level health — are where the bulk of investigation time is spent, and they reward the engineering effort spent on them more directly than the ones above or below. Infrastructure is the layer that is hardest to under-build and easiest to over-rely on, and is rarely the place where the cause actually lives. User feedback closes the loop on everything the other layers do not see.
The strategy does not have to be built top-down even though it should be read top-down. Most teams have a working infrastructure dashboard before they have a single synthetic check, and that is fine; the order things get instrumented is usually the order they were on fire last. What is worth resisting is the conclusion that, because the infrastructure layer is the one that has always been watched, it is the layer the incident is most likely in. The shape of the strategy — what each layer is supposed to answer and where the answers usually live — is what makes the difference between a team that traces incidents up the stack and a team that stares at CPU graphs hoping the wiggle will start meaning something.
Part 2 of this series gets into the cost of building those middle layers — what it takes in engineering time, what changes about how features ship, and why an observability strategy ends up being a development-process question more than a tooling question. Part 3 is about the on-call side — the practices that turn an instrumented system into one the on-call rotation can run without escalating most of the work back to the developers who built it.