ML Fundamentals

Imputing Missing Values with Median

Median imputation replaces missing data points with the middle value of the available dataset, providing a robust estimate against outliers.
It is a univariate imputation technique that ignores relationships between different features, making it computationally efficient but potentially simplistic.
The primary advantage of using the median over the mean is its resistance to skewed distributions and extreme values in the data.
Practitioners should evaluate if data is Missing Completely at Random (MCAR) before applying simple imputation methods to avoid introducing bias.

Why It Matters

Healthcare industry

In the healthcare industry, electronic health records (EHR) often contain missing laboratory results because not every patient undergoes every test. Researchers might use median imputation to fill in missing blood pressure readings for a cohort study when the data is missing at random. This allows them to maintain a large sample size for statistical analysis without discarding patients who only have partial records.

Retail demand forecasting

In retail demand forecasting, companies like Walmart or Amazon deal with massive datasets where sales figures for specific products might be missing due to inventory system glitches. By imputing the median sales volume for a particular product category, analysts can create a baseline for demand planning. This prevents the model from assuming a zero-sale day (which would be incorrect) or an extreme outlier, ensuring the supply chain remains stable.

Financial services sector

In the financial services sector, credit scoring models often encounter missing data in "Years of Employment" or "Annual Income" fields. Because income distributions are notoriously skewed by high earners, financial institutions prefer median imputation over mean imputation to avoid overestimating the average applicant's financial health. This ensures that the credit scoring algorithm remains robust and does not unfairly bias the model toward higher-income brackets due to skewed imputation.

How it Works

The Intuition of Central Tendency

When we encounter missing values in a dataset, we face a dilemma: do we delete the rows containing missing data, or do we fill them in? Deleting rows often leads to a significant loss of information, especially if the missingness is widespread. Imputation offers a way to preserve the sample size. The median is a "central" value. Imagine you are looking at the salaries of employees in a small startup. If one person is a billionaire founder, the mean salary will be artificially inflated, suggesting the average employee is a millionaire. The median, however, will reflect the salary of the person right in the middle of the pack. By using the median to fill in missing values, we are essentially saying, "If we don't know this value, let's assume it is the most typical value found in the rest of the data."

Why Median Over Mean?

In machine learning, the choice between mean and median imputation is almost always dictated by the distribution of the data. The mean is sensitive to outliers. If you have a feature representing "House Prices" and there is one mansion worth 100 times the average home, the mean will be skewed upward. If you use that mean to fill in missing values, you are effectively injecting that outlier's influence into every missing cell. The median, by definition, is the 50th percentile. It ignores the magnitude of the values at the extremes and focuses only on the rank order. This makes median imputation a safer default for features that are not perfectly normally distributed.

Limitations and Context

While median imputation is computationally inexpensive and easy to implement, it has significant drawbacks. First, it reduces the variance of the feature. By replacing multiple missing values with the exact same median, you artificially "clump" data points around the center, which can lead to an underestimation of the true standard deviation of the feature. Second, it ignores the correlation between features. If you are imputing a "Weight" column, the median ignores the fact that a taller person likely weighs more. Advanced techniques like K-Nearest Neighbors (KNN) or MICE (Multivariate Imputation by Chained Equations) account for these relationships, whereas median imputation treats each column as an isolated island. Therefore, median imputation is best suited for quick baselines or when the missingness is relatively low (typically less than 5-10% of the data).

Common Pitfalls

"Imputation adds information to the dataset." Many beginners believe that filling in missing values creates new knowledge. In reality, imputation is a way to handle missingness; it does not add new information and can actually introduce bias if the data is not missing at random.
"Median imputation is always the best choice." Some learners assume median is superior to mean in all cases. If the data is perfectly normally distributed, the mean and median are identical, and the mean is theoretically more efficient; median imputation is only "better" when outliers are present.
"I should calculate the median on the whole dataset before splitting." This is a classic case of data leakage. You must calculate the median only on the training set and apply that value to the test set to ensure your model evaluation is unbiased and reflects real-world performance.
"Median imputation preserves the correlation between features." This is false; median imputation is a univariate process that treats each column independently. It destroys the covariance structure between the imputed feature and other features in the dataset.

Sample Code

Python

import numpy as np
from sklearn.impute import SimpleImputer

# Create a sample dataset with missing values (NaN)
data = np.array([[10, 20], [np.nan, 25], [12, 30], [10, np.nan], [15, 35]])

# Initialize the imputer with 'median' strategy
imputer = SimpleImputer(strategy='median')

# Fit the imputer on the data and transform it
# The median of column 0 (10, 12, 15) is 12.0
# The median of column 1 (20, 25, 30, 35) is 27.5
imputed_data = imputer.fit_transform(data)

print("Original Data:\n", data)
print("\nImputed Data:\n", imputed_data)

# Expected Output:
# Original Data:
# [[10. 20.]
#  [nan 25.]
#  [12. 30.]
#  [10. nan]
#  [15. 35.]]
#
# Imputed Data:
# [[10. 20. ]
#  [12. 25. ]
#  [12. 30. ]
#  [10. 27.5]
#  [15. 35. ]]

Key Terms

Missing Data

Refers to the absence of values for certain observations in a dataset, which can occur due to sensor failure, user non-response, or data entry errors. If left unaddressed, missing data can lead to biased models or errors during the execution of machine learning algorithms.

Imputation

The process of replacing missing data with substituted values to maintain the integrity of the dataset for analysis. Imputation allows models to process complete matrices without discarding valuable rows of information.

Median

The middle value in a sorted list of numbers, which splits the dataset into two equal halves. Unlike the mean, the median is not heavily influenced by extreme values or outliers, making it a "robust" measure of central tendency.

Robustness

A property of a statistical estimator that indicates its performance remains stable even when the underlying data contains outliers or noise. Median imputation is considered robust because a single extreme value in the dataset does not shift the median significantly.

Univariate Imputation

A technique where missing values in a specific feature are filled using only the information present within that same feature. This contrasts with multivariate imputation, which uses correlations between multiple features to predict missing values.

Skewness

A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. When data is highly skewed, the mean is pulled toward the tail, whereas the median remains a more representative "typical" value.

Data Leakage

A common pitfall where information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates. When imputing, one must calculate the median only from the training set and apply that same value to the validation and test sets.