Tracking Down the Source

How 707 Emails Revealed a Six-Year Data Quality Issue

Character Key

Executive Summary

Discovery
Data Pull Request Initiated
Mar 2020
Sage the Heron's team takes over data operations from a former analyst who left. Goose requests the team submit a formal data pull request for FY2013-2020 fleet logistics data. The previous analyst's code and methodology are poorly documented.
Initial Comparison Reveals Differences
May 2020
Sage the Heron uploads new pull results. The team notices current data doesn't match previously reported figures. Goose flags:
"The discrepancy is between the older data pulls and the updated pulls."
New Data Arrives, Numbers Don't Match
Jul 2020
Pip the Otter begins processing new fiscal year data. Results look fundamentally different from prior years. Sage asks:
"How large are the differences compared to prior years?"
Escalation
Data Errors Formally Identified
Aug 2, 2020
Maple the Deer and Pip the Otter document specific structural errors in how record identifiers were combined. Harold the Badger is notified.
Working Group Meeting — Deep Dive Assigned
Aug 9, 2020
Internal working group meeting. The term "data errors" is formally adopted. Goose assigns Pip and Maple to:
"Do a deep dive into the data errors."
Plan due by August 11.
Felix's Breakthrough
Sep 21, 2020
Felix the Fox's analysis reveals the core issue: the previous data operations team only pulled "long-haul" shipment records from FY2013-FY2018. FY2019 was the first year that included all four record types (maintenance, fuel, long-haul, local delivery). Six fiscal years of data were structurally different.
Urgent Meeting Called
Sep 22, 2020
Harold the Badger calls all-hands:
"Can we connect to talk about the data pull?"
The scope — all reports using pre-FY2019 data are affected — becomes clear.
Investigation
Alice Investigates record types
Nov 2020
Alice the Owl requests breakdowns by record type and year. The team discovers FY2013-2018 data contains only long-haul shipment records while FY2019 contains all types.
Structural Root Cause Confirmed
Mar 2021
Archie the Hawk confirms: the data warehouse's default join from shipment records Header to Dispatch Detail sub-tables was an INNER JOIN, dropping results from ~762K to ~15K records, returning only long-haul shipment records. This silent filter caused six years of structural difference.
Formal Root Cause Analysis
Apr 24-25, 2021
Team conducts formal RCA using "5 Whys" methodology.
"There was an ordering of join conditionals that implicitly limited the data."
Contributing factor: the previous data handler worked in isolation with no parallel verification.
Action Items Documented
Jun 7, 2021
Harold the Badger documents follow-up: compare current raw data with historical data, examine former analyst's queries, create reference documents.
Resolution
Mystery Solved
Jul 19, 2021
Archie the Hawk confirms definitively:
"The default join IS an inner join. Even without a restrictive qualification for record type, linking to the Header Subtables defaults to ONLY returning long-haul shipment records."
Mitigation Plan: Blind Parallel Processing
Jul 24, 2021
Harold the Badger proposes blind parallel processing — where data handlers independently conduct pulls and confirm numbers match — as standard practice across all contracts.
Validation and Resolution
Aug-Oct 2021
Alice the Owl conducts validation pulls. The state coordinator reviews the mitigation plan. Harold the Badger presents evidence of the fix. New protocols implemented.

Glossary

RCA
Root Cause Analysis — a structured investigation to identify why a problem occurred, going beyond surface symptoms to find the underlying cause.
5 Whys
An iterative interrogation technique: ask "why?" five times to drill from a symptom to a root cause. Used in the formal RCA session.
Blind Parallel Processing
Two analysts independently perform the same data pull without seeing each other's work, then compare results to verify correctness.
Inner Join
A database operation that returns only records matching in both tables. Here, it silently filtered out non-long-haul shipment records by requiring a match in Dispatch Detail sub-tables.
Honest Broker
A trusted intermediary who handles sensitive data without direct interest in the outcome, ensuring objectivity in data operations.
Invoice Type
The classification of a logistics record — maintenance, fuel, long-haul, or local delivery. The root issue was that only one of four types was being returned.