"Is This Real?" – A Data Plausibility Check on the Chocolate Sales Dataset¶

The Trigger: A comment on Kaggle questioned whether this popular chocolate sales dataset represents real-world business data. That got me curious — so I decided to stress-test the numbers using the same kind of analytical checks you'd run when validating sales figures in a financial context: pricing consistency, concentration risk, structural patterns, and data completeness.

The Verdict: The data is almost certainly synthetically generated. Here's how I got there — step by step.

Dataset: Chocolate Sales on Kaggle | 3,282 transactions | 25 salespersons | 6 countries | 22 products | Jan 2022 – Aug 2024


1. Setup & First Look¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')

df = pd.read_csv(r"C:\Users\bvenn\OneDrive\Desktop\Python Projekte\Showcases\Tableau_projekte\Chocolate_sales\dataset\Schoki_sales.csv")
df['date'] = pd.to_datetime(df['date'])
df['revenue_per_box'] = (df['sales_amount'] / df['boxes_shipped']).round(2)

print(f'Shape: {df.shape}')
print(f'Date Range: {df["date"].min().date()} to {df["date"].max().date()}')
print(f'Salespersons: {df["sales_person"].nunique()} | Countries: {df["country"].nunique()} | Products: {df["product"].nunique()}')
print(f'\nMissing Values: {df.isnull().sum().sum()} | Duplicates: {df.duplicated().sum()} | Negative Values: {(df["sales_amount"] <= 0).sum()}')
df.head()
Shape: (3282, 7)
Date Range: 2022-01-03 to 2024-08-31
Salespersons: 25 | Countries: 6 | Products: 22

Missing Values: 0 | Duplicates: 0 | Negative Values: 0
sales_person country product date sales_amount boxes_shipped revenue_per_box
0 Jehu Rudeforth UK Mint Chip Choco 2022-01-04 5320 180 29.56
1 Van Tuxwell India 85% Dark Bars 2022-08-01 7896 94 84.00
2 Gigi Bohling India Peanut Butter Cubes 2022-07-07 4501 91 49.46
3 Jan Morforth Australia Peanut Butter Cubes 2022-04-27 12726 342 37.21
4 Jehu Rudeforth UK Peanut Butter Cubes 2022-02-24 13685 184 74.38

At first glance, a clean dataset: no missing values, no duplicates, no negative amounts. Exactly the kind of thing that looks fine until you start digging into the numbers.

2. Red Flag #1 – Pricing All Over the Place¶

Revenue per Box (RPB) is our key metric here. Average revenue per unit shipped. For any FMCG business, you'd expect relatively stable pricing per product, with moderate variation from volume discounts or seasonal promotions.

# ===============================
# Descriptive Statistics
# ===============================
print('=== Revenue per Box – Overall Distribution ===')
print(f'Minimum:  ${df["revenue_per_box"].min():,.2f}')
print(f'Median:   ${df["revenue_per_box"].median():,.2f}')
print(f'Mean:     ${df["revenue_per_box"].mean():,.2f}')
print(f'Maximum:  ${df["revenue_per_box"].max():,.2f}')
print(f'\nRPB > $500:  {(df["revenue_per_box"] > 500).sum()} transactions '
      f'({(df["revenue_per_box"] > 500).mean()*100:.1f}%)')
print(f'RPB < $1:    {(df["revenue_per_box"] < 1).sum()} transactions '
      f'({(df["revenue_per_box"] < 1).mean()*100:.1f}%)')

# ===============================
# Prepare extreme values
# ===============================
top5 = df.nlargest(5, 'revenue_per_box')[[
    'product', 'country', 'boxes_shipped', 'sales_amount', 'revenue_per_box'
]]

bottom5 = df.nsmallest(5, 'revenue_per_box')[[
    'product', 'country', 'boxes_shipped', 'sales_amount', 'revenue_per_box'
]]

# ===============================
# Create custom grid layout
# ===============================
fig = plt.figure(figsize=(18, 8))
gs = fig.add_gridspec(2, 2, width_ratios=[2, 1])

# ---- (1) Boxplot spans full left column ----
ax_box = fig.add_subplot(gs[:, 0])
sns.boxplot(data=df, x='product', y='revenue_per_box', ax=ax_box)
ax_box.set_xticklabels(ax_box.get_xticklabels(), rotation=90)
ax_box.set_title('RPB Distribution by Product')
ax_box.set_ylabel('Revenue per Box ($)')
ax_box.set_xlabel('Product')

# ---- (2) Top 5 Highest (top-right) ----
ax_top = fig.add_subplot(gs[0, 1])
ax_top.barh(range(5), top5['revenue_per_box'].values)
ax_top.set_yticks(range(5))
ax_top.set_yticklabels(
    [f"{row['product']} ({row['country']})" for _, row in top5.iterrows()]
)
ax_top.set_title('Top 5 Highest RPB')
ax_top.set_xlabel('Revenue per Box ($)')
ax_top.invert_yaxis()

# ---- (3) Top 5 Lowest (bottom-right) ----
ax_bottom = fig.add_subplot(gs[1, 1])
ax_bottom.barh(range(5), bottom5['revenue_per_box'].values)
ax_bottom.set_yticks(range(5))
ax_bottom.set_yticklabels(
    [f"{row['product']} ({row['country']})" for _, row in bottom5.iterrows()]
)
ax_bottom.set_title('Top 5 Lowest RPB')
ax_bottom.set_xlabel('Revenue per Box ($)')
ax_bottom.invert_yaxis()

plt.tight_layout()
plt.show()
=== Revenue per Box – Overall Distribution ===
Minimum:  $0.01
Median:   $38.19
Mean:     $111.33
Maximum:  $4,692.00

RPB > $500:  138 transactions (4.2%)
RPB < $1:    56 transactions (1.7%)
C:\Users\bvenn\AppData\Local\Temp\ipykernel_9728\651271402.py:34: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax_box.set_xticklabels(ax_box.get_xticklabels(), rotation=90)

An RPB spread of $0.01 to $4,692 — for chocolate. The median sits at $38, but the mean is $111, indicating massive right-skew.
138 transactions exceed $500 per box, 56 transactions are below $1. No realistic pricing model produces this kind of spread.ad. .

Coefficient of Variation – Pricing Consistency Check¶

The CoV (standard deviation / mean) measures how consistent a product's pricing is. For a real FMCG company, you'd expect CoV between 0.1 and 0.5 for most products.

rpb_by_product = df.groupby('product')['revenue_per_box'].agg(['mean', 'median', 'std', 'count'])
rpb_by_product['cov'] = (rpb_by_product['std'] / rpb_by_product['mean']).round(2)
rpb_by_product = rpb_by_product.sort_values('cov', ascending=False)

fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#e74c3c' if x > 2 else '#f39c12' if x > 1.5 else '#f1c40f' for x in rpb_by_product['cov']]
bars = ax.barh(rpb_by_product.index, rpb_by_product['cov'], color=colors)
ax.axvline(x=0.5, color='green', linestyle='--', linewidth=2, label='Expected FMCG range (0.1–0.5)')
ax.axvline(x=1.0, color='orange', linestyle='--', linewidth=2, label='CoV = 1 (std > mean)')
ax.set_xlabel('Coefficient of Variation')
ax.set_title('Pricing Consistency by Product (CoV)')
ax.legend()
ax.invert_yaxis()
plt.tight_layout()
plt.show()

print(f'Lowest CoV:  {rpb_by_product["cov"].min()} ({rpb_by_product["cov"].idxmin()})')
print(f'Highest CoV: {rpb_by_product["cov"].max()} ({rpb_by_product["cov"].idxmax()})')
print(f'\nProducts with CoV > 1: {(rpb_by_product["cov"] > 1).sum()} out of {len(rpb_by_product)}')
Lowest CoV:  1.43 (Almond Choco)
Highest CoV: 3.31 (Mint Chip Choco)

Products with CoV > 1: 22 out of 22

All 22 products have a CoV above 1.4. That means for every single product, the standard deviation exceeds the mean. Not a single product shows a stable pricing structure. In a real company, this would be an immediate red flag in any data review.

3. Red Flag #2 – Suspiciously Uniform Distribution¶

Real businesses have cash cows and underperformers. The Pareto principle (80/20 rule) tells us that roughly 20% of products typically generate 80% of revenue. Let's check.

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Product Pareto
prod_rev = df.groupby('product')['sales_amount'].sum().sort_values(ascending=False)
prod_rev_pct = (prod_rev / prod_rev.sum() * 100)
prod_rev_cum = prod_rev_pct.cumsum()

axes[0].bar(range(len(prod_rev)), prod_rev_pct.values, color='#2196F3', alpha=0.7)
ax2 = axes[0].twinx()
ax2.plot(range(len(prod_rev)), prod_rev_cum.values, 'r-o', markersize=4)
ax2.axhline(y=80, color='red', linestyle='--', alpha=0.5)
axes[0].set_xticks(range(len(prod_rev)))
axes[0].set_xticklabels(prod_rev.index, rotation=90)
axes[0].set_ylabel('Revenue Share (%)')
ax2.set_ylabel('Cumulative %')
axes[0].set_title('Product Revenue Concentration')

n_80_prod = (prod_rev_cum <= 80).sum() + 1

# Salesperson Pareto
sp_rev = df.groupby('sales_person')['sales_amount'].sum().sort_values(ascending=False)
sp_rev_pct = (sp_rev / sp_rev.sum() * 100)
sp_rev_cum = sp_rev_pct.cumsum()

axes[1].bar(range(len(sp_rev)), sp_rev_pct.values, color='#4CAF50', alpha=0.7)
ax3 = axes[1].twinx()
ax3.plot(range(len(sp_rev)), sp_rev_cum.values, 'r-o', markersize=4)
ax3.axhline(y=80, color='red', linestyle='--', alpha=0.5)
axes[1].set_xticks(range(len(sp_rev)))
axes[1].set_xticklabels(sp_rev.index, rotation=90)
axes[1].set_ylabel('Revenue Share (%)')
ax3.set_ylabel('Cumulative %')
axes[1].set_title('Salesperson Revenue Concentration')

n_80_sp = (sp_rev_cum <= 80).sum() + 1

plt.tight_layout()
plt.show()

print(f'Products needed for 80% revenue: {n_80_prod} of {len(prod_rev)} ({n_80_prod/len(prod_rev)*100:.0f}%) – expected: ~20%')
print(f'Salespersons needed for 80% revenue: {n_80_sp} of {len(sp_rev)} ({n_80_sp/len(sp_rev)*100:.0f}%) – expected: ~20%')
Products needed for 80% revenue: 17 of 22 (77%) – expected: ~20%
Salespersons needed for 80% revenue: 19 of 25 (76%) – expected: ~20%
sp_country = df.groupby('sales_person')['country'].nunique()

print(f'Number of markets: {df["country"].nunique()}')
print(f'Countries per salesperson: min={sp_country.min()}, max={sp_country.max()}')
print(f'\n-> Every single one of the 25 salespersons sells in all 6 countries.')
print('   In a real global company, you\'d expect regional coverage — not everyone selling everywhere.')
Number of markets: 6
Countries per salesperson: min=6, max=6

-> Every single one of the 25 salespersons sells in all 6 countries.
   In a real global company, you'd expect regional coverage — not everyone selling everywhere.

77% of products are needed to reach 80% of revenue: In a typical business, you'd expect closer to 20% (the Pareto principle). There are cash Cows and Underperformer. The same pattern holds for salespersons: 76% generate 80% of revenue. On top of that, every single salesperson operates across all 6 countries with no regional focus. Any one of these findings alone could be explained away. But the fact that products, salespersons, and geography all show the same suspiciously fl distribution at the same time is what makes this a red . R — real businesses almost always have some form of concentration somewhere.

4. Red Flag #3 – Where's Christmas?¶

Chocolate is a seasonal business. Halloween, Christmas, and Valentine's Day typically account a significant amount of annual revenue in in confectionery. Let's look at the calendar.

monthly = df.groupby([df['date'].dt.year.rename('year'), df['date'].dt.month.rename('month')]).agg(
    transactions=('sales_amount', 'count'),
    revenue=('sales_amount', 'sum')
)

tx_pivot = monthly['transactions'].unstack(level=0)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

tx_pivot.plot(kind='bar', ax=axes[0], colormap='Set2')
axes[0].set_title('Transactions per Month')
axes[0].set_xlabel('Month')
axes[0].set_ylabel('Number of Transactions')
axes[0].legend(title='Year')

months_present = df.groupby([df['date'].dt.year.rename('year'), df['date'].dt.month.rename('month')]).size().reset_index(name='count')
heatmap_data = months_present.pivot(index='year', columns='month', values='count').reindex(columns=range(1,13))
sns.heatmap(heatmap_data, annot=True, fmt='.0f', cmap='YlOrRd', ax=axes[1], 
            xticklabels=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],
            linewidths=0.5, cbar_kws={'label': 'Transactions'})
axes[1].set_title('Transaction Heatmap by Month & Year')
axes[1].set_ylabel('Year')

plt.tight_layout()
plt.show()

print('=== Transactions per Month (Year-over-Year) ===')
print(tx_pivot.to_string())
print(f'\n-> September through December: completely empty. No Q4 business.')
print(f'-> Transaction counts per month are IDENTICAL across all 3 years.')
=== Transactions per Month (Year-over-Year) ===
year   2022  2023  2024
month                  
1       154   154   154
2       110   110   110
3       131   131   131
4       118   118   118
5       135   135   135
6       163   163   163
7       149   149   149
8       134   134   134

-> September through December: completely empty. No Q4 business.
-> Transaction counts per month are IDENTICAL across all 3 years.

Two major issues here:

  1. Q4 is completely missing September through December, the peak revenue season for any confectionery business (Halloween, Christmas), simply doesn't exist in the data
  2. Transaction counts per month are identical across all three years. January always has exactly 154, February always 110, March always 131, and so on. In reality, these numbers would naturally fluctuate.

5. The Smoking Gun – Deterministic Patterns¶

This is where it gets forensic. We compare individual product-country combinations across all three years and look for patterns that would be impossible in real-world data.

samples = [
    ('White Choc', 'Canada'),
    ('Eclairs', 'Australia'),
    ('Drinking Coco', 'USA'),
    ('Mint Chip Choco', 'New Zealand'),
]

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

for idx, (product, country) in enumerate(samples):
    subset = df[(df['product'] == product) & (df['country'] == country)].sort_values('date')
    subset['month_day'] = subset['date'].dt.strftime('%m-%d')
    
    pivot_boxes = subset.pivot_table(index='month_day', columns=subset['date'].dt.year, values='boxes_shipped', aggfunc='first')
    
    ax = axes[idx]
    pivot_boxes.plot(kind='bar', ax=ax, colormap='Set2', width=0.8)
    ax.set_title(f'{product} | {country}', fontsize=12, fontweight='bold')
    ax.set_xlabel('Date (MM-DD)')
    ax.set_ylabel('Boxes Shipped')
    ax.legend(title='Year')
    ax.tick_params(axis='x', rotation=45)
    
plt.suptitle('Boxes Shipped: Same dates, same patterns — three years running', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# Systematic analysis: How many exact matches exist across all product-country combos?
all_results = []

for product in df['product'].unique():
    for country in df['country'].unique():
        subset = df[(df['product'] == product) & (df['country'] == country)].sort_values('date')
        if len(subset) == 0:
            continue
        
        subset['month_day'] = subset['date'].dt.strftime('%m-%d')
        pivot = subset.pivot_table(index='month_day', columns=subset['date'].dt.year, values='boxes_shipped', aggfunc='first')
        
        years = sorted(pivot.columns)
        if len(years) < 3:
            continue
            
        complete = pivot.dropna()
        if len(complete) == 0:
            continue
            
        exact_3y = (complete[years[0]] == complete[years[1]]) & (complete[years[1]] == complete[years[2]])
        
        pivot_rpb = subset.pivot_table(index='month_day', columns=subset['date'].dt.year, values='revenue_per_box', aggfunc='first')
        complete_rpb = pivot_rpb.dropna()
        if len(complete_rpb) > 0:
            monoton_rising = ((complete_rpb[years[2]] > complete_rpb[years[1]]) & (complete_rpb[years[1]] > complete_rpb[years[0]])).mean()
        else:
            monoton_rising = np.nan
        
        all_results.append({
            'product': product,
            'country': country,
            'n_dates': len(complete),
            'exact_matches_3y': exact_3y.sum(),
            'exact_match_pct': (exact_3y.mean() * 100).round(1),
            'rpb_monoton_rising_pct': (monoton_rising * 100).round(1) if not np.isnan(monoton_rising) else np.nan
        })

results_df = pd.DataFrame(all_results)

print('=== Deterministic Pattern Analysis ===')
print(f'Product-country combinations analyzed: {len(results_df)}')
print(f'Combinations with exact box matches (all 3 years): {(results_df["exact_matches_3y"] > 0).sum()} of {len(results_df)}')
print(f'Average exact match rate: {results_df["exact_match_pct"].mean():.1f}%')
print(f'\nRPB monotonically rising (2022 < 2023 < 2024):')
print(f'Average: {results_df["rpb_monoton_rising_pct"].mean():.1f}% of data points per combination')

print(f'\n=== Top 10 Combinations by Exact Match Rate ===')
print(results_df.nlargest(10, 'exact_match_pct')[['product', 'country', 'n_dates', 'exact_matches_3y', 'exact_match_pct']].to_string(index=False))
=== Deterministic Pattern Analysis ===
Product-country combinations analyzed: 132
Combinations with exact box matches (all 3 years): 37 of 132
Average exact match rate: 3.9%

RPB monotonically rising (2022 < 2023 < 2024):
Average: 48.0% of data points per combination

=== Top 10 Combinations by Exact Match Rate ===
             product     country  n_dates  exact_matches_3y  exact_match_pct
          White Choc      Canada        9                 4            44.40
Choco Coated Almonds         USA        4                 1            25.00
             Eclairs   Australia       10                 2            20.00
Choco Coated Almonds   Australia        5                 1            20.00
 Baker's Choco Chips      Canada        5                 1            20.00
     Mint Chip Choco          UK        6                 1            16.70
     99% Dark & Pure      Canada        6                 1            16.70
        Orange Choco          UK        6                 1            16.70
      70% Dark Bites New Zealand        6                 1            16.70
           Milk Bars      Canada        7                 1            14.30
# Deep dive: White Choc / Canada — the most striking example
wc_ca = df[(df['product'] == 'White Choc') & (df['country'] == 'Canada')].sort_values('date')
wc_ca['month_day'] = wc_ca['date'].dt.strftime('%m-%d')

pivot_detail = wc_ca.pivot_table(
    index='month_day', 
    columns=wc_ca['date'].dt.year, 
    values=['boxes_shipped', 'revenue_per_box'], 
    aggfunc='first'
)

print('=== White Choc | Canada — Year-over-Year Comparison ===')
print(pivot_detail.to_string())
print('\n-> Boxes shipped on Mar 29: exactly 1, 1, 1 — three years in a row')
print('-> Boxes shipped on Mar 22: exactly 3, 3, 3 — three years in a row')
print('-> Boxes shipped on Jul 11: exactly 4, 4, 4 — three years in a row')
print('-> RPB increases slightly each year: A built-in inflation factor (~3-5%)')
=== White Choc | Canada — Year-over-Year Comparison ===
          boxes_shipped           revenue_per_box                  
date               2022 2023 2024            2022     2023     2024
month_day                                                          
01-25               136  132  142           34.02    35.75    36.83
02-14                29   31   29          140.24   143.06   150.62
03-03                72   74   71           46.96    45.92    53.51
03-22                 3    3    3          140.00   160.33   156.00
03-29                 1    1    1        4,291.00 4,692.00 4,590.00
04-05               268  270  292           21.97    24.42    23.79
07-11                 4    4    4        1,646.75 1,746.25 1,746.75
07-15               173  185  179           53.61    53.82    59.07
08-11                15   15   15          504.00   512.33   571.87

-> Boxes shipped on Mar 29: exactly 1, 1, 1 — three years in a row
-> Boxes shipped on Mar 22: exactly 3, 3, 3 — three years in a row
-> Boxes shipped on Jul 11: exactly 4, 4, 4 — three years in a row
-> RPB increases slightly each year: A built-in inflation factor (~3-5%)

Reverse-Engineering the Generator¶

The patterns are clear now. The dataset was built with the following logic:

  1. Fixed date templates per product-country-salesperson combination: Same calendar days every year
  2. Boxes shipped = base value ± noise (~5%) for small numbers (1-6), the noise isn't enough to produce any variation, resulting in identical values across years
  3. Revenue = base value × (1 + inflation)^year ± noise a built-in YoY uplift of ~3-5%
  4. No interaction between dimensions salespersons have no regional territories, products have no differentiated market positions

6. The Volume-Price Distortion¶

One last piece of the puzzle: I noticed that low-volume shipments (<20 boxes) produce extreme RPB values. Is this a real pricing phenomenon or a generator artifact?

# Structural break: At what volume does RPB stabilize?
cutoffs = range(5, 105, 5)
results = []
for cutoff in cutoffs:
    subset = df[df['boxes_shipped'] >= cutoff]
    cov = subset['revenue_per_box'].std() / subset['revenue_per_box'].mean()
    results.append({'min_boxes': cutoff, 'n': len(subset), 'rpb_mean': subset['revenue_per_box'].mean().round(2), 'rpb_cov': cov.round(4)})

stability = pd.DataFrame(results)

fig, ax1 = plt.subplots(figsize=(10, 5))
ax2 = ax1.twinx()

ax1.plot(stability['min_boxes'], stability['rpb_cov'], 'o-', color='#e74c3c', linewidth=2, label='CoV (pricing dispersion)')
ax2.plot(stability['min_boxes'], stability['rpb_mean'], 's-', color='#2196F3', linewidth=2, label='Mean RPB ($)')

ax1.axvspan(5, 30, alpha=0.1, color='red', label='Unstable zone')
ax1.axvline(x=30, color='gray', linestyle='--', alpha=0.5)

ax1.set_xlabel('Minimum Boxes Shipped (Cutoff)', fontsize=12)
ax1.set_ylabel('Coefficient of Variation', color='#e74c3c', fontsize=12)
ax2.set_ylabel('Mean RPB ($)', color='#2196F3', fontsize=12)
ax1.set_title('RPB Stability by Volume Cutoff — Where Does Pricing Stabilize?', fontsize=13)

lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper right')
ax1.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Volume tier comparison
df['volume_tier'] = pd.cut(
    df['boxes_shipped'],
    bins=[0, 20, 50, 100, 200, 400, float('inf')],
    labels=['Micro (<20)', 'Small (20-50)', 'Medium (50-100)', 'Large (100-200)', 'Bulk (200-400)', 'Mega (400+)']
)

volume_stats = df.groupby('volume_tier', observed=True)['revenue_per_box'].agg(['mean', 'std', 'count'])
volume_stats['cov'] = (volume_stats['std'] / volume_stats['mean']).round(2)
print('=== RPB by Volume Tier ===')
print(volume_stats)
print(f'\n-> Micro tier: Mean RPB ${volume_stats.loc["Micro (<20)", "mean"]:.0f} vs. Mega tier: ${volume_stats.loc["Mega (400+)", "mean"]:.0f} — a {volume_stats.loc["Micro (<20)", "mean"]/volume_stats.loc["Mega (400+)", "mean"]:.0f}x spread')
=== RPB by Volume Tier ===
                  mean    std  count  cov
volume_tier                              
Micro (<20)     876.18 825.36    204 0.94
Small (20-50)   190.44 146.95    360 0.77
Medium (50-100)  83.23  68.93    638 0.83
Large (100-200)  41.76  31.94   1061 0.76
Bulk (200-400)   22.08  17.83    821 0.81
Mega (400+)      12.98   9.53    198 0.73

-> Micro tier: Mean RPB $876 vs. Mega tier: $13 — a 68x spread

The structural break confirms it: below ~30 boxes, pricing behaves fundamentally differently. This isn't a volume discount effect.
An RPB spread of $876 (Micro) vs. $13 (Mega) can't be explained by any realistic business model.
There are no metadata fields (order type, sales channel, packaging tier) that could justify these outliers.
In a real dataset, you'd expect that context to exist: here, it doesn't, because the generator simply didn't model it.

7. Verdict¶

findings = pd.DataFrame({
    'Finding': [
        'RPB spread $0.01 – $4,692 for chocolate',
        'CoV > 1.4 across all 22 products',
        'No Pareto effect (77% of products needed for 80% revenue)',
        'All 25 salespersons active in all 6 countries',
        'Q4 completely missing (Sep–Dec)',
        'Identical transaction counts per month across 3 years',
        'Deterministic date templates with noise overlay',
        'Built-in YoY inflation factor (~3-5%)'
    ],
    'Severity': ['Critical', 'Critical', 'High', 'High', 'High', 'Critical', 'Critical', 'Medium'],
    'Category': [
        'Pricing', 'Pricing', 'Concentration', 'Sales Structure',
        'Completeness', 'Structure', 'Structure', 'Structure'
    ]
})

print('=' * 80)
print('  VERDICT: Synthetically Generated Dataset')
print('=' * 80)
print()
print(findings.to_string(index=False))
print()
print('-' * 80)
print('The data shows none of the characteristics of a real business.')
print('Instead, it exhibits all hallmarks of a rule-based generator:')
print('fixed templates, deterministic noise, missing dimensional interactions,')
print('and a suspiciously uniform distribution across all segments.')
print('-' * 80)
================================================================================
  VERDICT: Synthetically Generated Dataset
================================================================================

                                                  Finding Severity        Category
                  RPB spread $0.01 – $4,692 for chocolate Critical         Pricing
                         CoV > 1.4 across all 22 products Critical         Pricing
No Pareto effect (77% of products needed for 80% revenue)     High   Concentration
            All 25 salespersons active in all 6 countries     High Sales Structure
                          Q4 completely missing (Sep–Dec)     High    Completeness
    Identical transaction counts per month across 3 years Critical       Structure
          Deterministic date templates with noise overlay Critical       Structure
                    Built-in YoY inflation factor (~3-5%)   Medium       Structure

--------------------------------------------------------------------------------
The data shows none of the characteristics of a real business.
Instead, it exhibits all hallmarks of a rule-based generator:
fixed templates, deterministic noise, missing dimensional interactions,
and a suspiciously uniform distribution across all segments.
--------------------------------------------------------------------------------

Conclusion¶

What started as a simple question — "Is this real data?" — turned into a full plausibility check. Using pricing analysis, concentration checks, structural break detection, and pattern recognition, we were able to systematically demonstrate that this dataset is synthetically generated.

The key evidence:

  • Not a single product has a realistic pricing structure (all CoV > 1.4)
  • Every salesperson sells in every country — no sales territories whatsoever
  • Identical transaction counts per month across three years: Statistically impossible with real data
  • Deterministic patterns: same calendar days, same base values, slight noise. The fingerprint of a generator

This doesn't mean the dataset is useless. It's perfectly fine for practicing visualizations and anlysis. I had loads of fun determining the data's nature.


Analysis performed with Python (pandas, seaborn, matplotlib). Inspired by a comment on Kaggle.