How Dirty Data Cost a Retailer R3 Million in Phantom Revenue

Nova Data Analytics
Mar 4
5 min read

1,050 transactions analysed 7 cleaning steps applied FY 2024 · South African Retail

A retail business closes its books for the year. The revenue report reads R7,129,011. The finance team is satisfied. Leadership is pleased. But the dataset has never been cleaned — and the real number is R4,143,223. That is an R2,985,788 overstatement. Not fraud. Not a system failure. Just four categories of data quality issues that almost every business has, and almost none actively address.

The problem

Most businesses are reporting the wrong numbers

Data quality is rarely treated as urgent until something goes wrong — a failed audit, a commission dispute, a board presentation that doesn't add up. By that point, the damage is already done. The challenge is that dirty data doesn't announce itself. It hides inside spreadsheets and databases, producing totals that look plausible right up until someone looks closely.

To demonstrate exactly how this happens, we constructed a realistic retail transaction dataset of 1,050 records across 10 product categories and 5 regions. We injected the four most common data quality issues found in real South African retail and e-commerce environments, then walked through what a business would report — and what the truth is.

Issue Audit

Four categories of problems. One very wrong number.

Before touching a single row of data, we audited the full 1,050-row dataset systematically. The goal was not just to find problems, but to quantify them — how many rows are affected, which fields are impacted, and what the financial consequence of each issue is.

🔴 Critical

Inflated Amounts

Data entry errors producing values 10x or 100x the correct amount. A R4,499 Smart Watch appearing as R449,900 in a single transaction.

7 rows affected · Largest single overstatement

🔴 Critical

Duplicate Transactions

50 rows duplicated — same Transaction ID appearing twice. Every duplicated row counts its revenue twice in any naive aggregation.

50 rows affected · Direct double-counting

🟠 High

Inconsistent Formatting

Region names in 22 variants. Dates in 5 different formats. Prices stored as strings with R symbols. Discount rates entered as percentages instead of decimals.

180+ rows affected · Breaks all aggregation

🟡 Medium

Missing Values

80 blank cells across Customer ID, Region, Discount Rate and Payment Method. Segmentation analysis is impossible when core fields are absent.

80 cells across 4 columns · Breaks reporting

Positive values indicate revenue overstatement (dirty data reports more than reality). Negative values indicate understatement (dirty data reports less). The inflated amounts category alone accounts for the majority of the R2.9M discrepancy.

The Cleaning Process

Seven steps. Executed in order. Fully audited.

Data cleaning must follow a deliberate sequence — structural issues first, value issues second, outliers last. Cleaning out of order produces misleading counts and can cause earlier steps to undo the work of later ones. Every step below was logged so the methodology is fully auditable.

Fixed
Remove Duplicate Transactions
50 duplicate rows identified by Transaction ID and removed, retaining the first occurrence. Note: 2 of the 60 date issues and 6 of the 80 missing values were carried away with these rows — a reconciliation tracked explicitly in the audit log.
Flagged&Filled
Handle Missing Values
74 remaining missing cells addressed with contextual defaults: unknown customers flagged as UNKNOWN, blank regions as UNASSIGNED, missing discount rates assumed to be 0%, unknown payment methods flagged for finance review.
Fixed
Standardise Region Names
73 rows with variant region names (north, NORTH, Nrth, N.) mapped to five canonical values. Without this step, regional revenue totals are split across multiple phantom categories.
Fixed
Standardise Date Formats
58 dates in non-standard formats (DD/MM/YYYY, MM-DD-YYYY, DD-Mon-YYYY) parsed and converted to ISO 8601 (YYYY-MM-DD). Monthly trend analysis is impossible without consistent date formatting.
Fixed
Clean Unit Price Field
11 prices stored as text strings with R symbols (e.g. "R1299.00") stripped and cast to numeric. Text values in numeric fields cause silent calculation failures in Excel and BI tools.
Fixed
Fix Discount Rate Formatting
9 rows where discount rates were entered as whole percentages (e.g. 10 instead of 0.10) corrected by dividing by 100. A 10% discount entered as 10 applies a 1,000% discount in calculations — eliminating all revenue from those rows.
Fixed
Correct Revenue Outliers
Revenue validated against a maximum possible legitimate transaction (5 units × R4,499 Smart Watch × 1.05 buffer = R23,619.75). 7 inflated rows, 15 negative revenues, and 10 zero-revenue rows corrected by recalculating from unit price, quantity, and discount.

Before vs After

The numbers that actually matter

The left bar shows what the business would report without data cleaning. The right bar shows the verified revenue after all seven cleaning steps. The R2,985,788 gap is entirely attributable to preventable data quality issues.

Metric	Dirty Data	Clean Data
Total Transaction Rows	1,050	1,000
Reported Revenue	R7,129,011	R4,143,223
Revenue Overstatement	R2,985,788	—
Error Rate	72.1%	0%
Unresolvable Records	Unknown	0

"The R2,985,788 discrepancy would lead to incorrect commission calculations, misinformed inventory decisions, and inaccurate tax reporting to SARS. Data cleaning is not a technical exercise — it is a business-critical process."

— From the full analysis notebook

Business Implications

This is not just a revenue number problem

The headline overstatement is striking. But the downstream consequences of operating on dirty data extend well beyond the revenue line.

Commission and incentive calculations based on inflated revenue will overpay or underpay sales teams. Disputes arise retrospectively and are expensive to resolve.

Inventory and procurement decisions driven by inflated demand signals lead to overstock, tied-up capital, and markdown losses when actual demand is lower than reported.

Regional performance reporting is meaningless when region names are split across 22 variants. North, NORTH, Nrth, and N. are four separate categories in any aggregation — none of them accurate.

Tax reporting to SARS on incorrect revenue figures creates compliance risk. Overstated revenue means potential overpayment; understated figures carry audit risk.

Executive dashboards and board reporting built on unclean data create false confidence. Decisions made on phantom numbers are systematically disconnected from business reality.

The takeaway

The fix is not complicated — but it must be systematic

The seven cleaning steps in this analysis took under two minutes to execute on 1,000 rows. The methodology is documented, repeatable, and auditable. The real challenge is not the cleaning itself — it is building these checks into a pipeline so that every data load is validated automatically before it ever reaches a report.

Most businesses are one spreadsheet export away from the scenario described in this article. The question is whether you know it before or after the numbers have been relied upon.

Want to see the full analysis?

The complete Jupyter notebook — including all code, step-by-step cleaning logic, and four interactive charts — is available on GitHub. If you want to know whether your own data has issues like these, we are happy to talk.

View on Github