Tensor Parallelism
Partitions individual tensor operations (specifically massive Matrix Multiplications) across multiple participating GPUs.
Source: mortalapps.com- Partitions individual tensor operations (specifically massive Matrix Multiplications) across multiple participating GPUs.
- Mathematically slices the weight matrices of linear layers into column-parallel and row-parallel processing chunks.
- Avoids the parameter discarding/fetching overhead of FSDP, but introduces severe, blocking communication collectives within both the forward and backward passes.
- Restricted almost exclusively to intra-node environments (e.g., within an 8-GPU NVLink chassis) due to extreme latency sensitivity.
Why This Matters
While FSDP and ZeRO-3 successfully shard parameters, they fundamentally require the full parameter tensor to be reconstructed onto a single GPU's memory space to execute the General Matrix Multiply (GEMM) operation. If a single layer's weight matrix (or its resulting activations) is so massive that it exceeds a single GPU's physical SRAM/HBM limits, FSDP and ZeRO critically fail. Tensor Parallelism (TP) circumvents this by physically distributing the mathematical computation itself across devices, making it an absolute architectural requirement for training and serving the largest monolithic LLMs (e.g., GPT-4, LLaMA-3 400B).
Core Intuition
The intuition relies on the decomposable nature of matrix multiplication. To compute , where
is a massive weight matrix, we can slice
vertically (Column Parallelism) into sub-matrices
and
. GPU 1 computes
and GPU 2 computes
. To assemble the final output, the GPUs simply concatenate the results: $$. If the subsequent neural network layer
is sliced horizontally (Row Parallelism), the GPUs compute local partial sums. A blocking AllReduce collective is then executed to mathematically sum these partial outputs together, producing the exact same result as if computed on a single massive processor.
Technical Deep Dive
Megatron-LM pioneered the modern MLP and Attention TP architecture.
MLP Block Decomposition:
The first Linear layer is configured as Column-Parallel. The input activation is broadcasted to all TP GPUs. Each GPU computes a fraction of the hidden state dimensionality. The non-linear activation function (e.g., GeLU) is applied entirely independently to these local slices. The second Linear layer is configured as Row-Parallel. It ingests the sliced hidden states, multiplies them by the horizontally sliced weights, and requires an immediate AllReduce to sum the partial results across the TP group.
Attention Block Decomposition: The Query, Key, and Value (Q, K, V) projection matrices are Column-Parallel, meaning the attention heads are cleanly divided across the GPUs. The final output projection matrix is Row-Parallel, terminating with an AllReduce. Because TP slices the activations along the hidden dimension, it natively reduces activation memory to
where
is the TP degree, though full activation partitioning across the sequence requires the addition of Sequence Parallelism.
TP Configuration
Layer Type
Input Requirement
Communication Outcome
Column Parallel
Linear Q/K/V, MLP 1
Broadcast / Identity
Local subset of outputs
Row Parallel
Linear Proj, MLP 2
Sliced activations
AllReduce (sums partials)