Production war stories/
The bug that made our alerts lie for months
A severity-1 observability blind spot: a log-shipping prefix broke level extraction, so "no errors" really meant "we stopped being able to see errors." Why a zero is a question, not an answer.
title: "The bug that made our alerts lie for months" series: "Production war stories" date: "2026-02-24" summary: "A severity-1 observability blind spot: a log-shipping prefix broke level extraction, so "no errors" really meant "we stopped being able to see errors." Why a zero is a question, not an answer."
The error alerts on a platform I help run had been quiet for a long time. Green dashboards, no pages, no "error rate spiking" messages in Slack. If you'd asked me, I'd have said things were healthy.
They were not healthy. They were silent, which is a different and much worse thing.
I only noticed because I went looking for a specific known error. The application had definitely logged errors that week — I'd seen one in a user report and reproduced it. So I opened the logs in Grafana, filtered to level="error", and got… nothing. Not "a few." Zero. Across the entire API log stream, for as far back as I scrolled.
That's the moment the floor drops out. A query that returns zero errors doesn't mean there are zero errors. It means either there really are none, or your ability to see them is broken. And if the ability to see errors is broken, then every alert built on top of "error rate" has been evaluating to false this whole time. The quiet wasn't health. It was a blind spot the size of the whole service.
Finding it
I stopped trusting the dashboard and went one layer down, querying the Loki API directly instead of through the Grafana UI. Same result: error-level queries came back empty, even though I could see the raw log lines containing the word "error" right there in the unfiltered stream. So the logs were arriving. They just weren't being labeled with a level that the queries could filter on.
That narrowed it to the ingestion pipeline — the stage that reads each log line and extracts a level so you can later query and alert on it. The lines were coming through the process manager (PM2), and PM2 was prefixing each line with its own bookkeeping before it reached the log shipper. That prefix was enough to break the level-extraction step: the parser expected a clean line and got a wrapped one, so it failed to pull a real level off almost every entry. Everything ended up effectively unlabeled. level="error" matched nothing because nothing was labeled error anymore — not because nothing was an error.
The fix itself was small: correct the shipping pipeline so it strips the prefix and parses the actual line, restoring real levels. Within minutes, level="error" lit up with the backlog it should have been showing all along, and level-based alerting started working again.
The part worth keeping
The fix was ten minutes. The lesson was the whole point.
"No alerts" and "no problems" are not the same signal, and your monitoring usually can't tell you which one you're in. A healthy service and a broken alerting pipeline produce the identical observable: silence. The only way to tell them apart is to deliberately go check that your alerting can still fire.
A few things I do now:
- Treat a zero as a question, not an answer. "Zero errors this week" should make you slightly suspicious, not relaxed — especially on a busy service. Cross-check against a signal that comes from a different path (a known reproduced error, request counts, downstream effects).
- Test the alert path end to end, on a schedule. Emit a synthetic error on purpose and confirm it shows up labeled correctly and trips the alert. If you only ever test the code, you're trusting that the entire logs-to-labels-to-alert chain still works by faith.
- Watch the seams between tools. This bug didn't live in the app or in the log store. It lived in the boring handoff between the process manager and the log shipper — exactly the kind of place no one owns and no test covers.
The scariest outages aren't the loud ones. The loud ones page you. The scary one is the service quietly telling you everything's fine, in a voice that stopped working months ago.