← Infrastructure Distributed AI Training
Infrastructure

Tensor Parallelism

Partitions individual tensor operations (specifically massive Matrix Multiplications) across multiple participating GPUs.

Source: mortalapps.com
TL;DR
  • Partitions individual tensor operations (specifically massive Matrix Multiplications) across multiple participating GPUs.
  • Mathematically slices the weight matrices of linear layers into column-parallel and row-parallel processing chunks.
  • Avoids the parameter discarding/fetching overhead of FSDP, but introduces severe, blocking communication collectives within both the forward and backward passes.
  • Restricted almost exclusively to intra-node environments (e.g., within an 8-GPU NVLink chassis) due to extreme latency sensitivity.

Why This Matters

While FSDP and ZeRO-3 successfully shard parameters, they fundamentally require the full parameter tensor to be reconstructed onto a single GPU's memory space to execute the General Matrix Multiply (GEMM) operation. If a single layer's weight matrix (or its resulting activations) is so massive that it exceeds a single GPU's physical SRAM/HBM limits, FSDP and ZeRO critically fail. Tensor Parallelism (TP) circumvents this by physically distributing the mathematical computation itself across devices, making it an absolute architectural requirement for training and serving the largest monolithic LLMs (e.g., GPT-4, LLaMA-3 400B).

Core Intuition

The intuition relies on the decomposable nature of matrix multiplication. To compute , where is a massive weight matrix, we can slice vertically (Column Parallelism) into sub-matrices and . GPU 1 computes and GPU 2 computes . To assemble the final output, the GPUs simply concatenate the results: $$. If the subsequent neural network layer is sliced horizontally (Row Parallelism), the GPUs compute local partial sums. A blocking AllReduce collective is then executed to mathematically sum these partial outputs together, producing the exact same result as if computed on a single massive processor.

Technical Deep Dive

Megatron-LM pioneered the modern MLP and Attention TP architecture.

MLP Block Decomposition:

The first Linear layer is configured as Column-Parallel. The input activation is broadcasted to all TP GPUs. Each GPU computes a fraction of the hidden state dimensionality. The non-linear activation function (e.g., GeLU) is applied entirely independently to these local slices. The second Linear layer is configured as Row-Parallel. It ingests the sliced hidden states, multiplies them by the horizontally sliced weights, and requires an immediate AllReduce to sum the partial results across the TP group.

Attention Block Decomposition: The Query, Key, and Value (Q, K, V) projection matrices are Column-Parallel, meaning the attention heads are cleanly divided across the GPUs. The final output projection matrix is Row-Parallel, terminating with an AllReduce. Because TP slices the activations along the hidden dimension, it natively reduces activation memory to where is the TP degree, though full activation partitioning across the sequence requires the addition of Sequence Parallelism.

TP Configuration

Layer Type

Input Requirement

Communication Outcome

Column Parallel

Linear Q/K/V, MLP 1

Broadcast / Identity

Local subset of outputs

Row Parallel

Linear Proj, MLP 2

Sliced activations

AllReduce (sums partials)

Key Takeaways

TP mathematically fractures massive matrix multiplications directly across an array of GPUs.
It imposes a rigid tax of two blocking AllReduce operations per Transformer block.
Scaling is strictly constrained to intra-node topologies () due to extreme network latency sensitivity.
TP remains the non-negotiable bedrock for serving/inference latency reduction and training the absolute largest foundation models.