Other information

About this Synthetic Data Generator

This free synthetic data generator creates binary classification datasets with fully controllable statistical properties — class separation, noise, imbalance, calibration, and distribution shape. It produces pairs of true labels and predicted probability scores so you can explore how model quality and data characteristics affect precision, recall, F1 score, ROC AUC, calibration, and confusion matrix metrics, without needing real model output.

Use it to build intuition before a job interview, experiment with the precision-recall tradeoff in a live classroom, pressure-test your understanding of calibration, or generate a labelled CSV to import into the Confusion Matrix Calculator. All generation runs client-side — nothing is uploaded to any server.

What You Can Simulate

Class Separation

Controls the gap between the positive and negative score distributions. High separation means a model can easily distinguish classes — ROC AUC and F1 are high even with a simple threshold. Low separation creates heavy overlap, making any classification threshold a compromise. This is the most direct lever for simulating strong vs. weak model discriminative ability.

Score Distribution: Gaussian vs. Beta

Gaussian (normal) distributions model scores that cluster symmetrically around a mean — a reasonable starting point for many classifiers. Beta distributions are bounded naturally within [0, 1] and can be skewed or U-shaped, better representing overconfident models that push predictions toward the extremes, or underconfident models that hedge near 0.5. Switching between them shows how distribution shape alone affects calibration curves and threshold sensitivity.

Class Imbalance (Positive Ratio)

Sets the fraction of samples that are genuinely positive. At 0.5 the dataset is balanced. At 0.05 only 1 in 20 samples is positive — mirroring fraud detection, rare disease screening, or anomaly detection. On highly imbalanced data, accuracy stays suspiciously high even as recall collapses. Watching MCC and balanced accuracy diverge from accuracy as you lower this slider is one of the most instructive things this tool can show.

Label Noise

Randomly flips a fraction of true labels, simulating annotation errors or ground truth uncertainty. Even 5–10% label noise degrades precision and recall noticeably while accuracy barely moves. This reveals why accuracy is a poor proxy for model quality on noisy real-world data, and why metrics that are sensitive to false positives and false negatives (F1, MCC) are preferred.

Calibration (Overconfident / Underconfident)

Model calibration measures whether predicted probabilities match actual outcome rates. An overconfident model pushes scores toward 0 and 1, making it seem decisive but misleading downstream systems that rely on probability estimates. An underconfident model compresses scores toward 0.5. Both are visible in the calibration curve and quantified by ECE (Expected Calibration Error) and Brier Score.

Outliers and Hard Examples

Outliers assign completely random scores to a fraction of samples — simulating cases a model gets catastrophically wrong. Hard examples pull scores toward the decision boundary (0.5), simulating inherently ambiguous inputs. Both increase irreducible error: no threshold choice can fully eliminate their FP and FN contributions. Real production models always contain both; these sliders let you see exactly how much each type of failure costs.

Frequently Asked Questions

What is a synthetic dataset and why generate one for ML?

Real datasets are messy — you rarely control class balance, noise level, or the true ground truth quality. Synthetic data flips that: you define the exact statistical properties, then observe how metrics respond. If you've ever wondered why F1 drops when you add noise but accuracy barely moves, or why a model with 0.9 ROC AUC can still have terrible precision on a 5% positive class dataset — generating synthetic data with known properties is the fastest way to build that intuition.

How do I simulate a highly imbalanced dataset?

Set Positive Ratio to 0.05 and observe: accuracy stays near 95% even as recall falls toward zero. This mirrors what happens in fraud detection, rare disease screening, or network intrusion detection. To see the full effect, compare accuracy vs. MCC and balanced accuracy at this setting — the gap reveals exactly how much accuracy is hiding. Then push the threshold up and watch FP shrink as FN explodes.

What does the decision threshold control?

The threshold is the score cutoff: samples above it are classified positive, below it negative. The score distribution histogram shows this visually — drag the threshold line and watch the confusion matrix update in real time. Moving left increases recall but reduces precision; moving right does the opposite. This is the precision-recall tradeoff made tangible. The optimal threshold depends entirely on the relative cost of false positives vs. false negatives in your use case.

What is model calibration and how does the calibration curve work?

A calibrated model is one where a predicted probability of 70% means roughly 70% of those samples are actually positive. The calibration curve plots predicted probability bins on the x-axis against actual positive rate on the y-axis. A perfect model hugs the diagonal. Overconfident models bow above it; underconfident models bow below. ECE (Expected Calibration Error) quantifies the average gap. Calibration matters most when probability estimates are used directly — risk scoring, cost-sensitive decisions, or probability thresholding in production.

What is the difference between label noise and prediction noise?

Label noise corrupts the ground truth — some labels are simply wrong, simulating annotation errors. Prediction noise adds jitter to the model's output scores without changing the labels. Label noise degrades every metric because the evaluation target itself is unreliable. Prediction noise makes the model seem less decisive — scores drift toward 0.5 — hurting calibration and sharpness without breaking the labels themselves. Both are common in production; the sliders let you isolate each effect.

How do I use the generated data with the Confusion Matrix Calculator?

Click Download and choose Prediction Scores (actual label + probability score, 0–1) to unlock the ROC curve, Precision-Recall curve, and Threshold Tuner in the Confusion Matrix Calculator. Choose Binary Predictions (actual + 0/1 predicted at your chosen threshold) if you only need the basic confusion matrix. Upload the CSV in the Confusion Matrix Calculator to verify your understanding of how the dataset's properties translate into real metrics.

What is the Brier Score?

Brier Score is the mean squared error between predicted probabilities and true binary labels, ranging from 0 (perfect) to 1 (worst). Unlike accuracy, it penalises confident wrong predictions heavily: saying 0.99 for a negative is punished far more than hedging at 0.55. A random classifier scores 0.25 on a balanced dataset. Because it captures both discrimination and calibration in one number, it is preferred in risk-sensitive applications where probability estimates matter, not just the binary decision.

What does the random seed control?

The seed pins the random number generator so the same configuration always produces the same dataset — useful for reproducible experiments and sharing setups with others via Copy Config. Change the seed to verify a pattern holds across multiple samples: if an observation repeats across five different seeds, it is a real property of the distribution. If it only appears on one seed, it is random variation, not signal.

Related Topics