How Dirty Data Cost a Retailer R3 Million in Phantom Revenue
Case Study · Data Quality · Business Analytics
FY 2024 SA Retail SME Python · pandas · Jupyter
The Problem
A South African retail SME approached Nova Data Analytics after noticing inconsistencies in their FY 2024 revenue figures. Their internal report showed total revenue of R7,129,011 - but something didn't add up. Seasonal trends looked wrong. Category-level numbers were implausible.
When we conducted a full data audit across their 1,050-record transaction dataset, we discovered that the reported figure was overstated by R2,985,788. The actual revenue for the period was R4,143,223 — 41.8% lower than reported.
This wasn't fraud. It was entirely the result of accumulated data quality failures, the kind that silently compound over months when no one is systematically checking the data.
What We Found
The audit identified five distinct categories of data quality issues across the 1,050 transaction records:
Our Approach
We applied a seven-step structured cleaning process, fully documented and reproducible via the GitHub repository. Each step was logged with before-and-after record counts tracked at every stage.
The Code
All analysis was conducted in Python using pandas and Jupyter Notebook. The full codebase is available on GitHub under an open-source licence.
Key Takeaways
-
Data quality failures are almost always accumulative — no single issue caused the R3M overstatement, but five compounding issues did.
-
Without a systematic audit, this error would have influenced budgeting, forecasting, and investor reporting for the entire financial year.
-
The cleaning process took less time than the business had spent second-guessing their own numbers.
-
All 1,050 records were recoverable — none had to be discarded.
.png)