Phantom Revenue RS | Nova Data Analytics

How Dirty Data Cost a Retailer R3 Million in Phantom Revenue

Case Study · Data Quality · Business Analytics

FY 2024 SA Retail SME Python · pandas · Jupyter

The Problem

A South African retail SME approached Nova Data Analytics after noticing inconsistencies in their FY 2024 revenue figures. Their internal report showed total revenue of R7,129,011 - but something didn't add up. Seasonal trends looked wrong. Category-level numbers were implausible.

When we conducted a full data audit across their 1,050-record transaction dataset, we discovered that the reported figure was overstated by R2,985,788. The actual revenue for the period was R4,143,223 — 41.8% lower than reported.

This wasn't fraud. It was entirely the result of accumulated data quality failures, the kind that silently compound over months when no one is systematically checking the data.

What We Found

The audit identified five distinct categories of data quality issues across the 1,050 transaction records:

Our Approach

We applied a seven-step structured cleaning process, fully documented and reproducible via the GitHub repository. Each step was logged with before-and-after record counts tracked at every stage.

The Code

All analysis was conducted in Python using pandas and Jupyter Notebook. The full codebase is available on GitHub under an open-source licence.

Key Takeaways

Data quality failures are almost always accumulative — no single issue caused the R3M overstatement, but five compounding issues did.
Without a systematic audit, this error would have influenced budgeting, forecasting, and investor reporting for the entire financial year.
The cleaning process took less time than the business had spent second-guessing their own numbers.
All 1,050 records were recoverable — none had to be discarded.