The Quiet Failure Review/Issue 14 — Vol. III/12 June 2026Press run holding · 4,820 copies/est. read 22 min
Field Notebook❦Operations · Postmortem¶Filed under: silent failures, drift, on-call
Automations that quietly broke.
Nine production scripts, two years of green dashboards, and the exact moment each one stopped doing what we thought it did.
MBy Marit Halversen, with field notes from the on-call rotation at Hauk & Berge.
Senior staff, Platform — written 04 → 21 May 2026, Trondheim & Lisbon
Issue 14Filed 21 May 2026, 03:11 CESTEdited by P. KnaakReviewed by SRE Guild, Cohort 9Status Published, second printing
9cases
automations cataloged
23months
observation window
412runs
silently wrong outputs
0alerts
fired during the window
11smells
in the detection list
€87k
est. cost of corrections
01 — Editor's letter
A letter from the on-call desk, written in the small hours
Marit Halversen · Trondheim, 04 May 2026 · 02:48
The first one we found by accident. Olav was rebuilding the weekly revenue dashboard in March 2024, because the old one had a date column that quietly switched from Europe/Oslo to UTC the previous autumn and nobody had complained. He ran the new query alongside the old one. The numbers disagreed by 4.7 percent. Then, because Olav is the sort of engineer who pulls a thread until something comes loose, he ran them against the source system. Three of them disagreed.
None of the dashboards were broken. The automation that fed them was broken. A scheduled job had been running on a stale view since November 2023 — a view that itself depended on a materialization that had silently stopped refreshing because a downstream Snowflake task had been paused during a migration and never resumed. Argo Workflows reported success every morning at 06:15. Datadog showed green. Slack received its daily "ETL completed in 4m11s" message at 06:19 like a clock. The job was doing exactly what it was written to do. It was just doing it on yesterday's yesterday.
What is in this issue is a catalog of nine such cases that we, at Hauk & Berge, found between February 2024 and April 2026. I have asked five other teams to contribute — the field reports in §05 come from Caia (a midsize logistics outfit in Porto), from a developer-tools company in Berlin, and from the SRE guild at a Norwegian regional bank we are not allowed to name. The cases are diverse: a slack-bot that addressed the wrong channel for sixteen months; a feature-flag rollback that silently re-rolled the next morning; a backup job that wrote zero-byte tarballs because of a quoting issue and reported success because tar exited cleanly. None of them paged. All of them were technically successful.
The thesis is small and uncomfortable: success metrics are the wrong invariant. A green check answers "did the job finish?" and not "did the job do the thing?". Most production automations are wired only to the first question. That is the gap we are documenting, and the smell list at the end of this issue (§06) is our attempt — in the spirit of Cantrill's "things you wish you knew" talks — to make the gap easier to spot before it costs anyone €87,000 and a Saturday.
I want to thank the on-call rotation for letting me publish their incident notes verbatim, and the editorial committee of the Quiet Failure Review for tolerating a fourteenth consecutive issue about things that "looked fine."
02 — The catalog
Nine automations that succeeded, and what they were actually doing
A summary table · sortable in the print edition, not here
Case index · ID, system, what monitoring saw, what the system actually did
No.
System
First seen
Role
What monitoring showed
Runs wrong
Days silent
01
etl-daily-revenue · Snowflake task
2024-03-19
drift
Argo: ok · 4m11s · 06:19 ping
138
128
02
nightly-tarball · backup-host-03
2024-05-02
silent
cron exit 0 · 22.4 GB reported
14
14
03
slackbot-alerts-on-call
2024-08-11
ghost
200 OK from Slack API
21
487
04
flag-rollback-canary · LaunchDarkly
2024-10-30
drift
Targeted rule reverted · audit ok
6
12
05
renew-cert-le · acme-renewer v3
2024-12-04
silent
"3 certs renewed" · 0 actually new
9
61
06
pagerduty-rotation-shifter
2025-01-18
ghost
Rotation rotated (just to itself)
7
49
07
airflow-dag · invoice-reissue
2025-06-22
drift
DAG green · 1,204 PDFs generated
1,204
1
08
scheduler-double-run · k8s CronJob
2025-09-09
silent
Both runs reported success, 4 min apart
216
73
09
gh-actions · release-publish
2026-02-14
drift
Tag pushed · artifacts uploaded
3
26
A note on the role column. silent means the job stopped doing the work but reported success.
drift means the job did different work than its name suggests, gradually, because something upstream moved.
ghost means the job ran, addressed the wrong target, and produced output that nobody read.
03 — Three cases, opened up
A tarball with no bytes, a slackbot in the wrong room, and 1,204 invoices to nobody
Field detail · case 02, case 03, case 07
CASE 02 · silent · backup-host-03
The nightly tarball that contained nothing
On 18 April 2024 we changed the backup root from /srv/data/prod to /srv/data/prod_v2. The script used a here-doc to build the path; a stray space made the resulting argument resolve to /srv/data, which is empty on backup-host-03.
tar exited 0. The cron mail said "22.4 GB written" because that was the previous evening's number, hard-coded into the wrapper for nostalgia. We discovered it on day fourteen when restoring for a customer.
The bot that addressed the wrong channel for 487 days
A channel rename in August 2023 (#oncall-payments → #payments-oncall) silently re-pointed our Slack app's webhook to a private channel with three former employees. Slack returned 200 OK because the channel still existed for the bot.
The bot sent 21 high-severity alerts to a room with no live readers. We found it because a former colleague, Anneli, replied "is anyone seeing this?" after she logged in on parental leave.
Audit trail · 21 unread P1s · 4 of them were latent payment failures
CASE 07 · drift · invoice-reissue
The DAG that reissued 1,204 paid invoices
A new column status_at was added to invoices on 21 June 2025. The reissue DAG joined on status and filtered on updated_at < now() - interval '7 days'. The migration backfilled updated_at on every row to the migration timestamp.
The next morning the DAG identified 1,204 invoices as "stale and unpaid" and reissued every one of them to customers who had paid. Two of them paid again.
The job did not break. The world around it broke in a way the job could not see, and the job kept doing what it had been told.
From the post-incident notes of CASE 07, written by P. Knaak, 24 June 2025
A green check is not an outcome. It is an opinion about an outcome, written by the same program that produced it.
Marit Halversen · §03 marginalia · Trondheim, May 2026
04 — Anatomy of a silent doubling
A scheduler ran twice for 73 days and nobody noticed
Case 08 · k8s CronJob · 09 Sept 2025 → 21 Nov 2025 · 216 duplicate runs
T-009 Sep 2025 14:11 CEST
Migration to a new namespace
The reporting CronJob was copied from cluster-a/legacy into cluster-a/reporting-v2 as part of the namespace cleanup. The legacy CronJob was meant to be paused. kubectl patch failed silently against the old CRD; the operator running the command typed y at a prompt that, in fact, was for a different terminal pane.
ObservedApply log says: "cronjob.batch/reporting patched". The patch was a no-op. Nobody re-checked schedule fields.
T+110 Sep 2025 03:00 — 03:04
Both schedulers fire
At 03:00 the new CronJob ran. At 03:04 the old one ran. Both completed in roughly two minutes. Both wrote to the same warehouse table — reporting.daily_partner_totals — using an UPSERT keyed on partner_id + date. The second write looked like an idempotent re-run and was therefore silently absorbed.
ObservedTwo Argo events; two Slack pings ("daily reporting ok"); the channel name is #etl-receipts and engineers ignore it on principle.
T+1222 Sep 2025
A subtle figure-of-eight in the numbers
For 24 partners that briefly went above a tier threshold between 03:00 and 03:04, the second write replaced the tier from "Bronze (boundary)" to "Bronze (final)" because of a row-locked re-computation. The dashboards averaged the two writes; the number drifted by 0.3–1.1% per partner. Nobody noticed because it was within the weekly seasonal noise band.
ObservedDrift visible in retrospect on the per-partner anomaly plot — but only with two-decimal precision, which the team had downsampled in February to reduce dashboard clutter.
T+4827 Oct 2025
A partner complains, very politely
Henning at Vinmonopolet's analytics team writes: "we're showing 4,212 transactions where you have 4,219. It is not a big deal but we would like to understand the seven." We thank him and start looking. We do not yet realize the figure-of-eight; we look at the API for two days first.
ObservedCustomer-side audit found the issue. Our pipeline did not. This is the third time in 23 months that "customer caught it first" appears in our incident log.
T+7321 Nov 2025 11:42 CEST
The double-run is found
P. Knaak runs kubectl get cronjobs -A while audit-trailing an unrelated namespace and notices two CronJobs with identical schedules in different namespaces. Total: 216 duplicate runs across 73 days. 14 of those days overlapped a partner-tier boundary. We rolled back the figure-of-eight by re-running the entire window with a deterministic single writer.
ObservedThe thing that found it was a human running get -A across namespaces. That command is now in our daily morning runbook, line 1.
Two schedulers running, both succeeding, both writing the same row, both confirming each other's lies — a more efficient form of being wrong is hard to imagine.
Incident review 08-RX · authored 22 Nov 2025 · approved by Platform & Finance
05 — Field reports
From Porto, from Berlin, and from a Norwegian bank we cannot name
Three short contributions · solicited via the QFR mailing list, March 2026
The Caia report is the one that makes us most uncomfortable, because it could happen to anyone — the renewer printed "renewed: ${domain}" as the last line of a shell function regardless of whether the inner acme-tiny call had returned a non-zero exit code. A retry loop earlier in the script swallowed errors. The certs began to expire one by one in late January; the operations team caught the first one at 04:17 on 30 January 2025, when a customer in Belém reported a TLS error to their support line. Nine more would have expired in the following 11 days.
The Berlin case has a different shape. Their nightly "rollback validator" was supposed to confirm that the targeting rule on a feature flag matched the rule that had been rolled out the previous evening. The validator hashed the rule before the variation was attached, and compared that hash to a stored hash from the previous evening — which had also been computed before the variation was attached. The two hashes always matched. The variation, in fact, had been silently overwritten by a separate scheduled "warmup" job that re-applied the previous week's targeting at 02:14 every morning. It took the team six rollouts to discover this: the flag-state on production was, every morning, one week stale, exactly six days out of date by Monday lunch.
The bank's case is the simplest and the most absurd. A PagerDuty rotation shifter, written in 2019 by a contractor whose name has been scrubbed from the repo, rotated a primary on-call across three rotations. By 2024 two of the rotations had been deleted (the API returned 404, which the script interpreted as "no shift needed") and the third had been reduced to one person, Solveig, who was therefore "rotated" onto herself every Monday. The script ran successfully. The audit log of PagerDuty showed seven weeks of "schedule updated" events that did not, in any meaningful sense, update anything. Solveig was on-call continuously from 11 November to 30 December 2024. She mentioned it in a 1:1, gently.
What the three cases share is not a code defect — the code did precisely what it was written to do. What they share is a missing invariant. None of the systems had a way to ask "did the world look different after I ran than before I ran, in the way I intended?" That question is harder to write than "did I exit 0," but it is the only question that detects this class of failure.
06 — A field smell list
Eleven signs your automation is quietly lying to you
Collected from the nine cases above · ordered by frequency of appearance
Smell 01 · 9 of 9
The success signal is produced by the same code path as the work
If the line that says "exit 0" is in the same script as the line that does the thing, your monitor cannot distinguish "I did the thing" from "I reached the end of the file." Move the success signal to a separate verifier that observes the result, not the process.
Smell 02 · 7 of 9
The metric is a count of runs, not a count of effects
"ETL ran 372 times last month" tells you nothing about whether 372 datasets are fresh. If you cannot back the success metric with a count of changed rows, delivered messages, renewed certs — you are counting attempts.
Smell 03 · 6 of 9
Idempotency is asserted but never tested
Every team we surveyed wrote "idempotent" in a docstring. None of them had a regression test that ran the job twice and compared outputs. The double-run in Case 08 was undetectable because both runs were "idempotent" — they overwrote each other cleanly.
Smell 04 · 6 of 9
The job's last meaningful change was more than 90 days ago
Cases 01, 02, 04, 05, 06, 09 had not been edited for a median of 217 days. The world around them had changed; they hadn't. Tag jobs by "last semantic edit" and review the long tail quarterly.
Smell 05 · 5 of 9
Output is delivered to a place nobody reads
Slack channels with five members; email aliases that bounce silently; a Notion page nobody has opened since 2023. If the output's read receipt is missing, the output is a tree falling in a forest with a microphone.
Smell 06 · 5 of 9
The job's name describes what it used to do
nightly-tarball, renew-cert-le, invoice-reissue — names that are aspirational, not descriptive. Rename to what the job currently produces: writes-tarball-to-s3-cold, requests-cert-and-installs-if-changed.
Smell 07 · 4 of 9
Errors are converted to log lines, not exits
Cases 03 and 05 both had inner failures that were printed but not raised. A grep over print.*error in your scheduled scripts is uncomfortable and instructive.
Smell 08 · 4 of 9
Two systems own the same schedule
If both Airflow and a k8s CronJob, or both GitHub Actions and a Lambda EventBridge rule, can fire the same logical task — assume both are firing until you have a single audit log proving otherwise. Case 08 is the canonical example.
Smell 09 · 3 of 9
The integration trusts a 200 OK as confirmation
Slack returns 200 even when the channel has zero readers. LaunchDarkly returns 200 even when the targeting hash matches a stale rule. A 200 is "I received your bytes," not "I did the thing."
Smell 10 · 3 of 9
The dashboard for this job has not been opened in 30 days
Grafana, Datadog, and Looker all expose "last viewed." If nobody is looking, nobody is going to notice the slope changing. Dashboards atrophy the same way muscles do.
Smell 11 · 2 of 9
The runbook says "if this fires, contact NAME"
NAME left the company in 2023. The page has 14 views in the last twelve months. Replace "contact NAME" with "open this Looker dashboard, look at the green line, if it is not green do X." Procedure beats personality.
07 — Questions readers keep sending us
A short FAQ, written in response to issue 13's mailbag
Selected from 184 messages received between 04 May and 19 May 2026
You keep saying "verify the effect, not the run." What does that look like in practice?
For Case 02 it would be a second job, running 20 minutes after the tarball job, that downloads the tarball, lists its contents, and compares the file count to the previous night's count ±5%. For Case 05 it would be a separate process that issues a TLS handshake against the renewed domain and reads the notAfter date.
The verifier must not share code with the job. It is uncomfortable. It feels like duplication. Duplication is, in this specific case, the point — you are deliberately denying yourself a shared bug.
Are you arguing against scheduled jobs?
No. We are arguing against believing scheduled jobs without a second source. Cron is fine. Argo is fine. k8s CronJobs are fine. What is not fine is letting the same artifact be the producer, the success signal, and the audit log.
Wouldn't observability tools (Datadog, Grafana, Honeycomb) have caught these?
In two cases (01 and 07) a sufficiently granular anomaly detector might have noticed the drift earlier. In the other seven, the observability tools were doing exactly what they were configured to do: report on the job's self-declared success. The tool is downstream of the lie.
We do not blame the vendors. They are not, and cannot be, the source of truth about your business outcome.
What about chaos engineering — kill the job, see what breaks?
That catches the case where the job doesn't run. It is excellent for that. It does not catch the case where the job runs and produces the wrong thing, which is the entire subject of this issue. Chaos for outputs is harder; the closest thing we have found that works is what we call shadow re-execution — re-running the previous week's job against a frozen copy of last week's inputs, and asserting that the output matches what was actually produced. Cases 01, 04, and 07 would have surfaced within one shadow cycle.
How do you justify the engineering cost of writing all these verifiers?
Honestly: we don't, in advance. We pay the cost after the first quiet failure costs more than the verifier would have. We argue that you should pre-pay only for the top three jobs by blast radius — payments, billing, and anything that emails customers. The other 40 jobs can wait for their first incident, because their first incident will be small.
Is there a tool for this?
We are aware of a small open-source project called verifier-rs (one of our SRE guild members maintains it) and of two commercial efforts that we are not yet ready to name. None of them solve the problem. They are scaffolding for writing verifiers, which is a useful thing, but the hard part is articulating the invariant. That is a thinking problem, not a tooling problem.
What is the single most effective change you've made?
Adding a column to our internal job registry called effect_observed_by. It is a free-text field. If the field is empty, the job is not allowed to be marked critical. If the field names a verifier, that verifier is also in the registry and must itself be observed by something. The chain ends at "a human looks at this number in the Monday review." Two-thirds of our jobs now have a non-empty effect_observed_by; the remaining third are explicitly classified as low-stakes.