What is a good F1 score?

A good F1 score depends on the domain. Generally, F1 > 0.7 is considered good, > 0.8 is very good, and > 0.9 is excellent. For critical applications like medical diagnosis, even 0.99 may be insufficient.

When should I prioritize recall over precision?

Prioritize recall when false negatives are costly — cancer screening, fraud detection, safety systems. Missing a real positive case is worse than triggering a false alarm.

ROC AUC (Area Under the Receiver Operating Characteristic Curve) measures a classifier's ability to distinguish between classes across all thresholds. AUC of 0.5 = random, 1.0 = perfect.

What is a false positive?

A false positive (Type I error) is when a model predicts positive but the true label is negative. Example: a spam filter marking a legitimate email as spam.

Why are precision-recall curves important?

PR curves are more informative than ROC curves for imbalanced datasets because they focus on the minority positive class. A high ROC AUC can coexist with poor PR performance when negatives dominate.

Confusion Matrix Calculator — Precision, Recall, F1 Score & ROC Visualizer

Q: Why is accuracy misleading on imbalanced datasets?

A classifier that predicts all negatives on a 99% negative dataset achieves 99% accuracy while being completely useless. Precision, recall, F1, and MCC are more reliable for imbalanced data.

▶ Other information

About this Confusion Matrix Calculator

This free interactive confusion matrix calculator computes precision, recall, F1 score, specificity, Matthews Correlation Coefficient (MCC), balanced accuracy, ROC AUC, Precision-Recall AUC, and 10+ additional classification metrics directly in your browser. Enter TP, FP, FN, TN values manually, choose a preset scenario, or upload a CSV of prediction scores to visualise ROC curves, PR curves, and threshold tuning — fully client-side, no data sent to any server.

Whether you're a student learning classification evaluation for the first time or a practitioner pressure-testing a production model, the calculator covers the full metric landscape: from simple accuracy and F1 to calibration-aware metrics like MCC and balanced accuracy that hold up on real-world imbalanced datasets.

How to Use

Enter your matrix — type TP, FP, FN, TN directly into the cells, or pick a preset (Balanced Dataset, High Precision, Highly Imbalanced, etc.) to start from a realistic scenario.
Upload prediction scores (optional) — upload a CSV with actual and score columns to unlock the ROC curve, Precision-Recall curve, and Threshold Tuner with your real model output.
Tune the threshold — drag the threshold slider or enter a value to see how precision, recall, and F1 shift at different operating points. The confusion matrix updates live.

Classification Metrics Reference

Precision

Precision = TP / (TP + FP). Of all the cases the model predicted as positive, what fraction were actually positive? High precision means few false alarms. Prioritize precision in spam filters, legal document review, and content recommendations where acting on a false positive is costly.

Recall (Sensitivity / True Positive Rate)

Recall = TP / (TP + FN). Of all actual positive cases, what fraction did the model find? Also called sensitivity or hit rate. High recall means few missed positives. Prioritize recall in cancer screening, fraud detection, and safety-critical systems where missing a real case is dangerous.

F1 Score

F1 = 2 × (Precision × Recall) / (Precision + Recall). The harmonic mean of precision and recall. Unlike arithmetic average, the harmonic mean heavily penalizes extreme imbalances — a model with 100% precision and 0% recall scores F1 = 0. Use F1 when you need a single number balancing both concerns on potentially imbalanced data.

Matthews Correlation Coefficient (MCC)

MCC uses all four confusion matrix cells and produces a value from −1 (perfectly wrong) to +1 (perfect), with 0 equal to random chance. Unlike accuracy and F1, MCC cannot be inflated by predicting the majority class. Most researchers consider it the most reliable single metric for binary classification, especially on imbalanced datasets where accuracy becomes meaningless.

Specificity (True Negative Rate)

Specificity = TN / (TN + FP). Of all actual negative cases, how many were correctly identified? It is the recall for the negative class. High specificity means few false alarms. The ROC curve plots recall (TPR) on the y-axis against 1 − specificity (FPR) on the x-axis as the threshold sweeps from 0 to 1.

ROC AUC and Precision-Recall AUC

ROC AUC summarises discriminative ability across all thresholds: 0.5 = random classifier, 1.0 = perfect. It equals the probability the model ranks a random positive higher than a random negative. PR-AUC is more informative on imbalanced datasets — it focuses entirely on the positive class and ignores true negatives. A high ROC AUC can coexist with poor PR performance when the positive class is rare. Upload prediction scores to compute both from your model's real output.

Frequently Asked Questions

What is a confusion matrix?

A confusion matrix is a table that breaks down a binary classifier's predictions into four counts: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). It shows exactly where the model succeeds and where it fails — something a single accuracy number cannot reveal. Every classification metric — precision, recall, F1, MCC, specificity, balanced accuracy — is derived directly from these four cells.

Why is accuracy misleading on imbalanced datasets?

If 99% of transactions are not fraud, a model that always predicts "not fraud" scores 99% accuracy while never catching a single fraudulent case. Accuracy rewards the majority class and hides complete failure on the minority. Whenever one class exceeds roughly 80% of the data, treat accuracy as suspect and switch to MCC, balanced accuracy, or F1 — metrics that account for the distribution of both classes.

What is the precision-recall tradeoff?

Precision and recall pull in opposite directions. Lowering the classification threshold makes the model predict positive more often — recall rises (more true positives caught) but precision falls (more false positives). Raising the threshold does the opposite. There is no free lunch: the right balance depends entirely on the relative cost of false positives vs. false negatives in your application.

When should I use PR curves instead of ROC curves?

Use ROC-AUC when classes are reasonably balanced and you want a threshold-independent summary of discriminative ability. Use PR-AUC when the positive class is rare (under ~10%), false negatives are costly, or you are primarily concerned with performance on the minority class. On highly imbalanced data, a strong ROC score can mask poor PR performance — always check both when prevalence is low.

What is the best single metric for binary classification?

MCC is the most broadly recommended single metric because it uses all four confusion matrix cells, is invariant to class imbalance, and cannot be inflated by predicting the majority class. Balanced accuracy is a simpler alternative with similar properties. F1 remains widely used in NLP and retrieval tasks. No single metric tells the whole story — always back up any summary metric with the full confusion matrix.

What is a classification threshold and how does it affect metrics?

Most classifiers output a probability score between 0 and 1; the threshold is the cutoff above which a sample is classified as positive. The default of 0.5 is rarely optimal. Lowering the threshold increases recall (catches more positives) and decreases precision (more false alarms); raising it does the opposite. The Threshold Tuner in this calculator requires uploaded prediction scores and shows the full tradeoff curve.

What do TP, FP, FN, and TN mean?

True Positive (TP): model predicted positive, actual was positive — a correct detection. False Positive (FP): predicted positive, actual was negative — a false alarm (Type I error). False Negative (FN): predicted negative, actual was positive — a missed detection (Type II error). True Negative (TN): predicted negative, actual was negative — a correct rejection. In most real problems, false positives and false negatives carry very different costs, making the full matrix more useful than any single summary number.

Can a model have high accuracy but low F1 score?

Yes — this is the classic imbalanced dataset trap. A model that predicts the majority class 100% of the time achieves high accuracy but scores F1 = 0 on the minority class, because it never makes a true positive prediction. Conversely, a model with lower accuracy can achieve a much higher F1 by correctly identifying minority-class cases at the cost of some majority-class errors. This is why accuracy and F1 should always be evaluated together alongside the full confusion matrix.