Statistics & Probability

Measures of Central Tendency

Central tendency metrics provide a single value that represents the "center" or "typical" point of a distribution.
The mean is sensitive to outliers, whereas the median provides a robust estimate for skewed data.
The mode identifies the most frequent observation, making it essential for categorical or multimodal datasets.
Selecting the appropriate measure depends entirely on the data's scale, distribution shape, and the presence of extreme values.
In machine learning, these measures serve as the foundation for feature engineering, data imputation, and loss function design.

Why It Matters

Financial sector

In the financial sector, banks use the median income of a neighborhood to determine credit risk rather than the mean. If a neighborhood has one billionaire and ninety-nine low-income workers, the mean income would suggest an affluent area, leading to incorrect risk assessments. By using the median, the bank identifies the true economic status of the majority of the residents, ensuring more accurate lending models.

E-commerce

In e-commerce, companies like Amazon or Alibaba use the mode to manage inventory and supply chain logistics. When analyzing the most frequently purchased sizes or colors of a product, the mean is mathematically useless, and the median is irrelevant. The mode tells the warehouse manager exactly which SKU is in highest demand, allowing for optimized stock levels and reduced storage costs.

Healthcare

In healthcare, clinical researchers often use the mean to report the average duration of a hospital stay for a specific procedure. However, they must also report the median to account for patients with rare complications who stay for weeks, which would otherwise inflate the average. Comparing the mean and median allows hospital administrators to identify if a few "long-stay" patients are skewing their operational efficiency metrics, helping them distinguish between standard care and exceptional cases.

How it Works

Understanding the Center

When we analyze data, we are often overwhelmed by the sheer volume of individual observations. If you have a dataset of 10,000 house prices, you cannot look at each price individually to understand the market. You need a summary. Measures of central tendency are the tools we use to condense this information into a single, representative number. Think of the "center" as the point around which the rest of the data points cluster. However, "center" is not a single concept; it depends on how you define "typical."

The Mean: The Balancing Point

The arithmetic mean is the most intuitive measure. Imagine you have a seesaw. If you place weights on the seesaw representing your data points, the mean is the exact point where the seesaw balances perfectly. Because it incorporates every single value in the dataset, it is mathematically elegant and useful for further statistical operations. However, this is also its greatest weakness. If you add a single, massive outlier—such as a billionaire moving into a neighborhood of modest homes—the mean will shift drastically, no longer representing the "typical" resident.

The Median: The Positional Middle

The median is the value that splits your data into two equal halves. If you line up all your data points from smallest to largest, the median is the one standing exactly in the middle. Because it only cares about the rank of the data rather than the magnitude, it is immune to the influence of extreme outliers. In machine learning, we often prefer the median when dealing with real-world data that is "dirty" or contains sensor errors, as it provides a more stable estimate of the central value.

The Mode: The Most Popular Choice

The mode is the value that occurs most frequently. While the mean and median are strictly for numerical data, the mode is the only measure of central tendency that works for categorical data. For example, if you are analyzing the "color" of cars sold, you cannot calculate a mean or median, but you can certainly identify the most popular color. In multimodal distributions—where data clusters around two or more distinct peaks—the mode helps us identify these separate groups, whereas the mean would simply land in the empty space between them.

Choosing the Right Measure

Selecting the correct measure is a critical step in the data preprocessing pipeline. If your data follows a normal distribution (the classic bell curve), the mean, median, and mode will all be roughly equal. In this scenario, the mean is preferred because it uses all available information. However, if your data is skewed—such as income distribution, where a few high earners pull the mean to the right—the median is a much better representation of the "typical" person. Failure to choose the correct measure can lead to biased models, as the model will learn from a "center" that does not actually represent the majority of the data.

Common Pitfalls

"The mean is always the best measure of center." Learners often assume the mean is the default choice, but it is only appropriate for symmetric, outlier-free data. In skewed distributions, the mean provides a misleading representation of the "typical" case.
"The median is only for small datasets." Some believe the median is computationally expensive to calculate compared to the mean. While sorting takes $O(n \log n)$ time, modern algorithms make this trivial for most datasets, and the robustness gained is worth the cost.
"A dataset must have a single mode." Many students are confused by multimodal data, assuming there is a mistake if two values appear with the same high frequency. In reality, multimodal data is common and often indicates the presence of distinct subgroups within the population.
"Outliers should always be removed." Learners often think that because outliers distort the mean, they should be deleted. However, outliers are often the most interesting data points, and using the median allows you to keep them in the analysis without letting them corrupt your summary statistics.

Sample Code

Python

import numpy as np
from scipy import stats

# Generate a synthetic dataset with an outlier
data = np.array([10, 12, 12, 13, 12, 11, 14, 100])

# Calculate Mean
mean_val = np.mean(data)

# Calculate Median
median_val = np.median(data)

# Calculate Mode
mode_result = stats.mode(data, keepdims=True)
mode_val = mode_result.mode[0]

print(f"Dataset: {data}")
print(f"Mean: {mean_val:.2f}")   # Output: 18.00 (Skewed by 100)
print(f"Median: {median_val:.2f}") # Output: 12.00 (Robust)
print(f"Mode: {mode_val}")       # Output: 12 (Most frequent)

# ML Application: Imputing missing values
# In practice, we use the median to fill NaNs to avoid outlier bias
data_with_nan = np.array([10, 12, np.nan, 13, 12])
imputed_data = np.nan_to_num(data_with_nan, nan=np.nanmedian(data_with_nan))
print(f"Imputed Data: {imputed_data}")

Key Terms

Arithmetic Mean

The sum of all values in a dataset divided by the total number of observations. It is the most common measure of center but is highly sensitive to extreme values or outliers.

Median

The middle value in a dataset when the observations are sorted in ascending or descending order. It is considered a "robust" statistic because it remains unaffected by the magnitude of extreme outliers.

Mode

The value that appears most frequently in a dataset. A dataset can have one mode, multiple modes (multimodal), or no mode if all values appear with equal frequency.

Outlier

A data point that differs significantly from other observations in a sample. In statistical analysis, outliers can disproportionately pull the mean toward them, often leading to misleading interpretations of the "center."

Skewness

A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. Positive skew indicates a tail on the right side, while negative skew indicates a tail on the left.

Robust Statistics

Statistical methods that are designed to be resistant to errors or outliers in the data. These methods, such as using the median instead of the mean, ensure that the analysis remains valid even when the data is "noisy."

Central Tendency

A statistical measure that attempts to identify the single value that is most representative of an entire distribution. It acts as a summary statistic that describes the location of the bulk of the data.