This free interactive confusion matrix calculator computes precision, recall, F1 score, specificity, Matthews Correlation Coefficient (MCC), balanced accuracy, ROC AUC, Precision-Recall AUC, and 10+ additional classification metrics directly in your browser. Enter TP, FP, FN, TN values manually, choose a preset scenario, or upload a CSV of prediction scores to visualise ROC curves, PR curves, and threshold tuning — fully client-side, no data sent to any server.
Whether you're a student learning classification evaluation for the first time or a practitioner pressure-testing a production model, the calculator covers the full metric landscape: from simple accuracy and F1 to calibration-aware metrics like MCC and balanced accuracy that hold up on real-world imbalanced datasets.
actual and score columns to unlock the ROC curve, Precision-Recall curve, and Threshold Tuner with your real model output.Precision = TP / (TP + FP). Of all the cases the model predicted as positive, what fraction were actually positive? High precision means few false alarms. Prioritize precision in spam filters, legal document review, and content recommendations where acting on a false positive is costly.
Recall = TP / (TP + FN). Of all actual positive cases, what fraction did the model find? Also called sensitivity or hit rate. High recall means few missed positives. Prioritize recall in cancer screening, fraud detection, and safety-critical systems where missing a real case is dangerous.
F1 = 2 × (Precision × Recall) / (Precision + Recall). The harmonic mean of precision and recall. Unlike arithmetic average, the harmonic mean heavily penalizes extreme imbalances — a model with 100% precision and 0% recall scores F1 = 0. Use F1 when you need a single number balancing both concerns on potentially imbalanced data.
MCC uses all four confusion matrix cells and produces a value from −1 (perfectly wrong) to +1 (perfect), with 0 equal to random chance. Unlike accuracy and F1, MCC cannot be inflated by predicting the majority class. Most researchers consider it the most reliable single metric for binary classification, especially on imbalanced datasets where accuracy becomes meaningless.
Specificity = TN / (TN + FP). Of all actual negative cases, how many were correctly identified? It is the recall for the negative class. High specificity means few false alarms. The ROC curve plots recall (TPR) on the y-axis against 1 − specificity (FPR) on the x-axis as the threshold sweeps from 0 to 1.
ROC AUC summarises discriminative ability across all thresholds: 0.5 = random classifier, 1.0 = perfect. It equals the probability the model ranks a random positive higher than a random negative. PR-AUC is more informative on imbalanced datasets — it focuses entirely on the positive class and ignores true negatives. A high ROC AUC can coexist with poor PR performance when the positive class is rare. Upload prediction scores to compute both from your model's real output.
A confusion matrix is a table that breaks down a binary classifier's predictions into four counts: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). It shows exactly where the model succeeds and where it fails — something a single accuracy number cannot reveal. Every classification metric — precision, recall, F1, MCC, specificity, balanced accuracy — is derived directly from these four cells.
If 99% of transactions are not fraud, a model that always predicts "not fraud" scores 99% accuracy while never catching a single fraudulent case. Accuracy rewards the majority class and hides complete failure on the minority. Whenever one class exceeds roughly 80% of the data, treat accuracy as suspect and switch to MCC, balanced accuracy, or F1 — metrics that account for the distribution of both classes.
Precision and recall pull in opposite directions. Lowering the classification threshold makes the model predict positive more often — recall rises (more true positives caught) but precision falls (more false positives). Raising the threshold does the opposite. There is no free lunch: the right balance depends entirely on the relative cost of false positives vs. false negatives in your application.
Use ROC-AUC when classes are reasonably balanced and you want a threshold-independent summary of discriminative ability. Use PR-AUC when the positive class is rare (under ~10%), false negatives are costly, or you are primarily concerned with performance on the minority class. On highly imbalanced data, a strong ROC score can mask poor PR performance — always check both when prevalence is low.
MCC is the most broadly recommended single metric because it uses all four confusion matrix cells, is invariant to class imbalance, and cannot be inflated by predicting the majority class. Balanced accuracy is a simpler alternative with similar properties. F1 remains widely used in NLP and retrieval tasks. No single metric tells the whole story — always back up any summary metric with the full confusion matrix.
Most classifiers output a probability score between 0 and 1; the threshold is the cutoff above which a sample is classified as positive. The default of 0.5 is rarely optimal. Lowering the threshold increases recall (catches more positives) and decreases precision (more false alarms); raising it does the opposite. The Threshold Tuner in this calculator requires uploaded prediction scores and shows the full tradeoff curve.
True Positive (TP): model predicted positive, actual was positive — a correct detection. False Positive (FP): predicted positive, actual was negative — a false alarm (Type I error). False Negative (FN): predicted negative, actual was positive — a missed detection (Type II error). True Negative (TN): predicted negative, actual was negative — a correct rejection. In most real problems, false positives and false negatives carry very different costs, making the full matrix more useful than any single summary number.
Yes — this is the classic imbalanced dataset trap. A model that predicts the majority class 100% of the time achieves high accuracy but scores F1 = 0 on the minority class, because it never makes a true positive prediction. Conversely, a model with lower accuracy can achieve a much higher F1 by correctly identifying minority-class cases at the cost of some majority-class errors. This is why accuracy and F1 should always be evaluated together alongside the full confusion matrix.