A Credibility Check of the Chocolate Sales Dataset
Trusting the numbers before trusting the insights.
2026This one started simply enough. I found the Chocolate Sales dataset on Kaggle, liked the structure, and figured it would make decent material to sharpen some Tableau skills. Clean columns, a few years of data, multiple countries and salespeople. Nothing suspicious on the surface. Then I was lucky to read the comments.
One Kaggler had left a note: “Odd prices.” The person was not sure whether this was a record of real transactions or a generated dataset dressed up to look like one. That was enough. The Tableau dashboard could wait. I wanted to know what I was actually looking at before building anything on top of it.
The short answer: it is synthetic. Almost certainly. The long answer is more interesting.
TRANSACTION REGULARITY
Over three years, every single month contained the exact same number of transactions. Not approximately the same. Identical. Real sales data does not behave this way. Demand fluctuates. People get sick. Deals fall through. Calendars are uneven. This kind of uniformity does not emerge from the world. It gets programmed in.
MISSING Q4
The fourth quarter was absent entirely. September, October, November, December: all missing. For most industries this would be odd. For a chocolate company, it would be catastrophic. Halloween and Christmas alone account for a significant share of annual confectionery revenue, and the run-up starts in September. A dataset covering three years of chocolate sales with no Q4 and no September is not just incomplete. It is implausible.
IDENTICAL SHIPMENT COUNTS
86% of products showed identical box shipment counts for the same month-day combinations across years. The generation logic appears to be: base value plus or minus noise of roughly five percent. For small base values between one and six boxes, that noise band is not wide enough to produce any variation at all. So the same number of boxes ships on the same date, year after year, product after product. Operationally, this is not how supply chains work.
REVENUE PER BOX
This is where things get genuinely strange. The Revenue per Box metric, which ought to reflect something close to a fixed price with minor variation, was scattered beyond any reasonable range. The least volatile product in the dataset had a Revenue per Box ranging from $0.83 to $669.28. Same product. Same company. No deluxe edition, no bulk discount, no seasonal pricing strategy explains a 800x spread. Charging a customer $0.83 per box when another customer pays $669.28 for the same product is not a pricing model. It is noise that got out of hand.
COEFFICIENT OF VARIATION
The Coefficient of Variation quantifies pricing consistency. A CV above 1.0 means the standard deviation exceeds the mean, which indicates extreme dispersion. Every single product in this dataset had a CV above 1. In a real FMCG company, you would expect values below 0.5. This is not a subtle discrepancy. It is the statistical fingerprint of a revenue column that was generated by multiplying a base value by a noise factor with a loose upper bound.
NO TERRITORIAL LOGIC
Every salesperson sells in every country the company operates in. There are no territories, no regional responsibilities, no market specialisations. In real sales organisations, coverage is divided. Salespeople have accounts. Countries have teams. Here, every combination of salesperson and country exists in the data with exact equal representation.
NO PARETO EFFECT
In real product portfolios, roughly 20% of products drive 80% of revenue. There are always cash cows and there are always underperformers that get carried. In this dataset, revenue is distributed almost uniformly across products. Everything performs about as well as everything else. That is not a healthy portfolio. That is a random number generator with a fixed seed.
GENERATION_LOGIC (reverse-engineered)
date_template = fixed calendar values per product × country × salesperson combination
boxes_shipped = base_value ± noise(~5%) // too small for variation at low values
revenue = base_value × (1 + inflation)^year ± noise // ~3-5% YoY uplift
The picture is fairly clear. The dataset was almost certainly built from fixed date templates applied across every product-country-salesperson combination, with noise layered on top. No interaction between dimensions. No territorial logic. No portfolio dynamics. No seasonal reality.
None of which makes it a bad dataset. It is well-structured, consistently formatted, and genuinely useful for practising BI and visualisation skills. That was the original plan and it still holds. It just also turned out to be a reasonably entertaining detective case along the way. The Kaggler who wrote “Odd Prices” was onto something. They just did not follow the thread.
Interactive Dashboard
> Use filters and tooltips to explore the data. The dashboard may require scrolling on smaller screens.
Jupyter Notebook
> Full analysis — rendered from the original .ipynb file.