Question 1

What is a synthetic data generator for machine learning?

Accepted Answer

A synthetic data generator creates artificial datasets with controlled statistical properties. For binary classification, it produces pairs of true labels and predicted probability scores so you can explore how distribution shape, noise, and class imbalance affect evaluation metrics like precision, recall, F1, ROC AUC, and calibration — without needing real model output.

Question 2

What is the difference between Gaussian and Beta score distributions?

Accepted Answer

Gaussian (normal) distributions model scores that cluster symmetrically around a mean — typical of well-calibrated models. Beta distributions are bounded between 0 and 1 and can be skewed or U-shaped, better representing overconfident models that push scores toward the extremes, or underconfident models that cluster near 0.5.

Question 3

How does the positive ratio affect confusion matrix metrics?

Accepted Answer

Positive ratio (class prevalence) directly controls class imbalance. At 50% the classes are balanced. At 5%, the dataset is highly imbalanced toward negatives. Imbalanced datasets inflate accuracy (predicting all-negative achieves high accuracy) and cause precision and recall to diverge sharply — making F1 score and ROC AUC more informative measures.

Question 4

What does label noise do to model metrics?

Accepted Answer

Label noise randomly flips true labels, simulating annotation errors in training or test data. Higher noise degrades all metrics — especially precision and recall — because some true positives become false negatives and vice versa. The calibration curve also degrades as scores no longer align with the noisy labels.

Question 5

What is model calibration and how does the calibration curve work?

Accepted Answer

Calibration measures how well predicted probabilities match actual outcome rates. A perfectly calibrated model has a calibration curve along the diagonal: when the model predicts 70% probability, roughly 70% of those cases are truly positive. An overconfident model's curve bows above the diagonal; underconfident bows below. ECE (Expected Calibration Error) summarises the gap numerically.

Question 6

How do I use the synthetic data with the Confusion Matrix Calculator?

Accepted Answer

Click Download and choose either 'Prediction Scores' (actual label + probability, for ROC/PR curve analysis) or 'Binary Predictions' (actual + 0/1 prediction at a given threshold, for confusion matrix entry). Then upload that CSV in the Confusion Matrix Calculator to verify your calculations against a real dataset.

Question 7

What does the separation parameter control?

Accepted Answer

Separation is the difference between the positive class mean score and the negative class mean score. Higher separation means the model assigns distinctly higher probabilities to true positives than true negatives — resulting in better ROC AUC, easier threshold selection, and higher precision and recall across the board.

Question 8

What are hard examples and how do they affect metrics?

Accepted Answer

Hard examples are samples that are inherently difficult to classify — positive samples with low predicted scores and negative samples with high predicted scores. They increase the overlap between the score distributions, reducing AUC and making precision–recall trade-offs steeper. Real datasets always contain hard examples near decision boundaries.

Question 9

What is the Brier Score?

Accepted Answer

Brier Score is the mean squared error between predicted probabilities and true binary labels. It ranges from 0 (perfect) to 1 (worst). Unlike accuracy, it penalises confident wrong predictions heavily. A random classifier scores 0.25 on a balanced dataset. Brier Score combines both discrimination and calibration.

Question 10

What does the random seed control?

Accepted Answer

The seed initialises the random number generator so you can reproduce exactly the same dataset. Changing the seed produces a different random sample with the same statistical properties. Use 'Randomize seed' to get a fresh sample each time you click Regenerate, or fix the seed to share a reproducible configuration with others via Copy Config.

Question 11

What is distribution shift and why does it matter?

Accepted Answer

Distribution shift simulates the mismatch between training and deployment data. Covariate shift moves the input distribution (score means shift), while prior shift changes class prevalence at inference time. Both degrade real-world metrics relative to held-out test set performance, which is why models often underperform in production compared to evaluation benchmarks.

Question 12

How does the threshold slider affect precision and recall?

Accepted Answer

The threshold determines which predicted probabilities are classified as positive. Lowering the threshold increases recall (more true positives caught) but decreases precision (more false positives). Raising it does the opposite. The score distribution histogram shows this trade-off visually — the threshold line splits the two class score distributions.

About this Synthetic Data Generator

What You Can Simulate

Class Separation

Score Distribution: Gaussian vs. Beta

Class Imbalance (Positive Ratio)

Label Noise

Calibration (Overconfident / Underconfident)

Outliers and Hard Examples

Frequently Asked Questions

What is a synthetic dataset and why generate one for ML?

How do I simulate a highly imbalanced dataset?

What does the decision threshold control?

What is model calibration and how does the calibration curve work?

What is the difference between label noise and prediction noise?

How do I use the generated data with the Confusion Matrix Calculator?

What is the Brier Score?

What does the random seed control?

Related Topics