When people say "we have observability" they usually mean "we have a tool." That sentence is a category mistake, and it costs more than it looks like it costs. The tool can only show what your code chooses to emit. Buying the tool without writing the code is buying a microscope and pointing it at a blank slide.
This is Part 2 of a three-part series on building an observability strategy. Part 1 walked through the layered structure of what to monitor and why — the checklist of questions a working program is supposed to answer. Part 3 covers the operational side once the strategy is in place: the practices that reduce escalations from on-call rotations to the developers who built the system. This part — Part 2 — is the middle piece: what it costs to write the code that produces those answers, why that cost is bigger than most teams plan for, and where in the development process the time has to come from.
The single most useful number for getting the scale of this right is one a careful reader can verify themselves: in the OpenTelemetry Demo — a reference application maintained by the OpenTelemetry community, written across Go, Java, .NET, C++, Python, Node, and a handful of other languages so each service shows what proper instrumentation looks like in that runtime — somewhere between 30% and 45% of the lines of code in each service are observability instrumentation. Not feature code. Instrumentation: the spans that get started and ended, the metrics that get recorded, the structured log statements, the context propagation, the resource attributes.
Pause on that. Roughly a third of the codebase in a system that the OpenTelemetry maintainers consider properly instrumented is observability. It is not 5%. It is not a sprint at the end. It is closer to the share of code that handles error paths than to the share that handles a single feature.
What the 30–45% number captures is the application-side instrumentation: structured logs, traces, metrics, context propagation across service boundaries, resource attributes. What it does not capture is the rest of the program — the collector configuration that routes those signals where they need to go, the dashboards built on top of them, the alert rules that turn the metrics into pages. Those live elsewhere in the repository and add to the total. The 30–45% is a floor.
What that number means for how a team works
If instrumentation is a third of the code, it deserves the same engineering disciplines as the other two-thirds. Most teams treat it as something one developer adds quickly before a feature ships. That works for the feature shipping in isolation; it fails as soon as another developer needs to follow the same conventions on the next feature, or three developers do, or the team grows past the point where any one person can hold the conventions in their head.
The set of practices that change once instrumentation crosses the threshold of "this is real engineering" is small but specific. Standards are the first one. A team that has a documented convention for service architecture and API contracts almost always has nothing equivalent for metric naming, label cardinality, span attribute keys, log structure. Without those, every service ends up with its own idioms, the dashboards become a museum of inconsistent terminology, and a new engineer has to learn the conventions of five services to read across them. The standards do not have to be elaborate; they have to exist and to be enforced in code review.
Code review is the second. Reviewers who know what good feature code looks like often have no opinion on instrumentation: they wave through a metric with no labels, or a label whose cardinality is unbounded, or a span boundary placed where it cuts a meaningful operation in half, because none of those make the feature fail in tests. The reviewer who knows that an unbounded label can blow up the metrics backend on the next traffic spike, or that a span boundary on a goroutine handoff hides the actual latency, is the one who catches them. Building that reviewer takes the same kind of practice as building a reviewer who is good at concurrency bugs, and the same kind of pairing.
QA on the data itself is the practice that is most often missing entirely. It is one thing to verify that a feature does what the spec says; it is another to verify that the metric the feature emits arrives at the dashboard with the right value, with the right labels, at the right granularity. The shape of that QA is different from feature QA: less "does the test pass" and more "open the dashboard, exercise the path, look at the panels." Teams that do not budget time for this kind of check ship instrumentation that compiles and runs and is wrong, and find out about it during the incident the instrumentation was supposed to help with.
Security review is the discipline that catches the failure mode every team has in their backlog of compliance fines: sensitive data in logs. Names, email addresses, payment tokens, session identifiers, anything covered by a regulation — the path from "let's log the request payload to help debug" to "we logged credit card numbers for six months" is short and well-trodden. The review has to happen on the instrumentation diff, not on the feature diff.
Operations collaboration is the last. The instrumentation has to support the alerting and runbook practices of the operations team, which means deciding together what metrics are critical, what thresholds make sense, and what context an engineer paged at 2 a.m. needs to find the issue without further help from the developer who built the feature. The alternative — defining alerts after the instrumentation ships, by an operations team reading code they did not write — produces alerts that fire on the wrong condition or fire on the right condition with no useful context.
What this costs in development time
The codebase share does not map directly to time share — instrumentation is faster per line of code than the feature it instruments, because the patterns repeat and the libraries do the heavy lifting. But the time cost is still substantial. A representative example, sketched on a project shaped like the OpenTelemetry Demo, is a useful anchor.
A 28% increase in development time is the kind of number that sounds either modest or alarming depending on what you compare it to. Compared to the cost of building the feature, it is modest. Compared to the line in a project plan that says "instrumentation, EOD Friday," it is alarming. The number is honest when the comparison is to feature work delivered with the discipline that lets it be supported in production for years — same QA, same review, same security and operations involvement. It is dishonest when the comparison is to feature work that ships fast and accrues operational debt that someone else pays later.
The case for spending the 28% is not abstract. It is faster incident detection, which translates to shorter incidents. It is the ability to answer "why is this slow" in minutes from a dashboard instead of in days from a postmortem. It is fewer 2 a.m. pages that the on-call engineer cannot diagnose without paging the developer who wrote the code. Those benefits compound over a system's lifetime; the upfront cost does not. The math reliably goes the same direction once a system runs in production for long enough for the second incident.
How to think about this from each seat
For a business leader: the line you are looking at when somebody hands you the engineering plan is one that pays for itself, and the time horizon over which it pays for itself is shorter than most capital line items on the same plan. The frame that fails is treating the line as an overhead bucket and trimming it under deadline pressure. The systems that get trimmed there are the ones that show up in the postmortem of the next outage as the reason it took longer to find the cause than to fix it.
For an engineering manager: the work has to be scheduled, not absorbed. Allocate the 28% explicitly in the project plan rather than expecting it to fit into the cracks between feature work. The standards work — naming, labeling, structured logging conventions — pays off on the second project that uses them, so the time spent setting them up early is recovered before the first year is out. The QA discipline for the data is the part that is hardest to introduce mid-stream; if you are starting fresh, build it into the sprint shape from the beginning.
For a developer: the change in practice is small but specific. Instrument as you write the feature, not after it. Use the team's conventions; if they do not exist yet, start the conversation that produces them. Check the dashboard for your own work before calling it done — if the metric you added does not show up, or shows up wrong, find out now, not when somebody else is reading it during an incident. The habit of treating the instrumentation diff as part of the feature diff, with the same standards of review and test, is the one that distinguishes systems that are easy to operate from ones that are not.
Part 3 of this series turns from the development side to the operations side: how an observability strategy shaped this way changes the on-call experience for developers, and why a well-instrumented system reduces the escalation rate from on-call to engineering rather than increasing it.