Statistics & Probability

Fundamentals of Probability Theory

Probability theory provides the mathematical framework for quantifying uncertainty in data-driven systems.
Random variables map outcomes of stochastic processes to numerical values, forming the basis of statistical modeling.
Conditional probability and Bayes' Theorem are the engines behind modern machine learning inference and classification.
Distributions, such as Gaussian or Bernoulli, allow us to model the underlying generative processes of real-world phenomena.
Mastering these fundamentals is essential for understanding loss functions, optimization, and uncertainty estimation in deep learning.

Why It Matters

Financial sector

In the financial sector, companies like JPMorgan Chase use probability theory to perform Value at Risk (VaR) assessments. By modeling the probability distribution of asset returns, they can estimate the maximum potential loss over a given time frame with a specific confidence level. This allows banks to manage capital reserves effectively and mitigate the risks associated with market volatility.

Healthcare

In healthcare, diagnostic algorithms—such as those developed by IBM Watson Health—utilize Bayesian networks to assist in clinical decision-making. By calculating the conditional probability of a disease given a set of symptoms and patient history, these systems help doctors narrow down potential diagnoses. This probabilistic approach accounts for the inherent uncertainty in medical tests and patient reporting, leading to more robust clinical outcomes.

Autonomous vehicles

In the domain of autonomous vehicles, companies like Waymo and Tesla use probabilistic sensor fusion. Sensors like LiDAR, radar, and cameras provide noisy data about the environment, and the vehicle must maintain a "belief state" about the position of surrounding objects. By using Kalman filters and Bayesian updates, the vehicle continuously calculates the probability distribution of where other cars and pedestrians are located, allowing for safe navigation in complex, unpredictable urban environments.

How it Works

The Intuition of Uncertainty

At its core, probability theory is the language of uncertainty. In machine learning, we rarely have perfect information. Whether we are predicting stock prices, classifying images, or generating text, we are essentially trying to model a process that contains inherent randomness. Probability allows us to turn this "noise" into a structured mathematical object. Think of a random variable as a container for an unknown value; before we measure it, it exists as a distribution of possibilities. Once we observe data, we update our belief about that variable. This is the fundamental cycle of learning: starting with a prior assumption, observing evidence, and arriving at a posterior conclusion.

Discrete vs. Continuous Domains

Distinguishing between discrete and continuous domains is the first step in selecting an appropriate model. Discrete probability deals with countable sets—like the number of clicks on an ad or the number of words in a sentence. We use PMFs here because we can assign a specific probability to each integer. Continuous probability, however, deals with measurements like time, temperature, or pixel intensity. Because there are an infinite number of points in any interval, the probability of the variable being exactly a specific number is zero. Instead, we use PDFs to describe the probability of the variable falling within a range. This distinction dictates whether we use summation or integration in our calculations.

The Power of Conditional Logic

Conditional probability is arguably the most important concept for an ML practitioner. Most models are essentially conditional probability estimators. When you train a neural network to classify an image, you are training it to learn $P(Label | Image)$ . You are asking the model: "Given these specific pixel values, what is the probability that the object is a cat?" By conditioning on the input data, we reduce the uncertainty of the output. This logic extends to joint distributions, where we look at the relationship between multiple variables simultaneously. Understanding how variables interact—and whether they are independent or dependent—is the difference between a model that captures the structure of data and one that merely memorizes it.

Bayesian Inference and Updating

Bayesian statistics provides a formal mechanism for updating our knowledge. We start with a "prior"—our initial belief about a parameter. We then collect data and calculate the "likelihood" of that data given our model. By combining the prior and the likelihood, we arrive at the "posterior," which is our updated belief. This is not just a theoretical exercise; it is the foundation of Bayesian Neural Networks and uncertainty quantification. In high-stakes fields like medicine or autonomous driving, knowing how confident a model is in its prediction is just as important as the prediction itself. Probability theory provides the rigorous framework to quantify that confidence.

Common Pitfalls

Confusing Probability with Odds Learners often conflate probability (a ratio of favorable outcomes to total outcomes) with odds (a ratio of favorable outcomes to unfavorable outcomes). While they are related, they are not interchangeable; a probability of 0.5 corresponds to odds of 1:1, not 0.5.
The Gambler's Fallacy Many assume that if an event happens more frequently than normal during a given period, it will happen less frequently in the future, or vice versa. In reality, independent events (like coin flips) have no "memory," and the probability of the next outcome remains unchanged regardless of past results.
Ignoring the Prior In Bayesian inference, beginners often focus entirely on the likelihood of the data while ignoring the prior distribution. The prior is essential for regularization and preventing overfitting, especially when the amount of observed data is small.
Misinterpreting the PDF A common mistake is thinking the value of a PDF at a specific point is a probability. Because the area of a single point is zero, the PDF value can actually be greater than 1; it is a density, not a probability, and must be integrated over an interval to yield a meaningful probability.

Sample Code

Python

import numpy as np

# Simulating a Bernoulli process (e.g., a biased coin flip)
# Let p = 0.7 be the probability of 'Heads'
p = 0.7
n_trials = 10000

# Generate random samples using NumPy
samples = np.random.binomial(n=1, p=p, size=n_trials)

# Calculate the empirical probability (the frequentist approach)
empirical_p = np.mean(samples)

# Calculate the expected value (mean) and variance
expected_value = p
variance = p * (1 - p)

print(f"Empirical Probability: {empirical_p:.4f}")
print(f"Theoretical Expected Value: {expected_value}")
print(f"Theoretical Variance: {variance:.4f}")

# Output:
# Empirical Probability: 0.7012
# Theoretical Expected Value: 0.7
# Theoretical Variance: 0.2100

Key Terms

Sample Space

The set of all possible outcomes of a random experiment, denoted by the Greek letter Omega (

\Omega

). It represents the "universe" of possibilities from which an event can be drawn. For instance, in a coin toss, the sample space is {Heads, Tails}.

Random Variable

A function that maps outcomes from a sample space to real numbers. It allows us to perform algebraic operations on stochastic events, transforming qualitative outcomes into quantitative data. We distinguish between discrete random variables, which take countable values, and continuous ones, which take values in an interval.

Probability Mass Function (PMF)

A function that gives the probability that a discrete random variable is exactly equal to some value. It must satisfy two conditions: the probability of each outcome must be between 0 and 1, and the sum of all probabilities must equal 1. It is the discrete counterpart to the probability density function.

Probability Density Function (PDF)

A function used for continuous random variables that describes the relative likelihood for the variable to take on a given value. Unlike the PMF, the value of the PDF at a single point is not a probability; instead, the area under the curve over an interval represents the probability. The total area under the entire curve must equal 1.

Conditional Probability

The probability of an event occurring, given that another event has already occurred. It is denoted as

P(A|B)

and is calculated by restricting the sample space to the outcomes where event B is true. This concept is fundamental to Bayesian inference and predictive modeling.

Independence

Two events are independent if the occurrence of one does not change the probability of the other. Mathematically, this is expressed as

P(A \cap B) = P(A)P(B)

. In machine learning, assuming feature independence is a common simplification, as seen in Naive Bayes classifiers.

Expected Value

The weighted average of all possible values that a random variable can take, where the weights are the probabilities of those values. It represents the long-term average outcome of a random process if it were repeated many times. It is a central measure of the "center" of a distribution.