Parallelism strategies for neural network training
Any large-enough neural network training job (e.g., training a LLM or an image or video generation model) employs parallelism strategies. In fact, such large scale training jobs often employ multiple strategies in conjunction. There are two reasons for using parallelism. First, the models are large enough that they don't fit on a single accelerator's memory and hence need to be split up with each split being trained in parallel. Second, parallelism can help speed up the training process. In this post, I will cover the different types of parallelism that are commonly used. Note that I use GPUs as the accelerator but the parallelism strategies are equally applicable to any accelerator (GPU or TPU or any-fancy-PU).
Data Parallelism
Data parallelism helps speed up the training process by training the model on different parts of the data in parallel. Consider this simple example. I have a model that fits on a single GPU. I want to train this model with a large corpus of data. I can have the GPU process each data item and update the model sequentially. But I can also use two GPUs and have each GPU go through half of the data, cutting the training time in half. But now I end up with two distinct copies of the models. I can “merge” or “sync” the models by averaging out the weights learned by each of the copies of the model. In practice, I wouldn’t wait for each GPU to go through its entire share of the data before syncing. Instead, I will split up the dataset into small batches and sync at the end of each batch. This ensures that each GPU incorporates the learning of the other GPU a periodic intervals. This is the basic idea of data parallelism. Its simplicity also makes it easily composable with other kinds of parallelisms. E.g., consider that the model did not fit on a single GPU and instead was split in a complex manner using tensor and/or pipeline parallelism (described below). I could still speed up training by standing up two replicas of the setup and having the two setups train on separate datasets in parallel.
Tensor Parallelism and Pipeline Parallelism
Tensor parallelism and pipeline parallelism are different types of model parallelism. The idea is to split a large model onto multiple GPUs. Pipeline parallelism is more intuitive (to me at lease). If the model has 16 layers and each GPU can only hold 2 layers, I can fit the entire model on a total of 8 GPUs. Each GPU processes its part of the model (2 layers) and passes the relevant information (e.g., activations, gradients) to the GPU with the neighbouring layers. One of the primary challenges with pipeline parallelism is keeping the GPUs occupied. For example, in the naive design, the first GPU (with the first 2 layers) stays idle while the other 7 GPUs perform their computations of a forward pass followed by a backward pass. To avoid such idleness, the training data batch is split into smaller subsets called mini-batches and training proceeds at a mini-batch granularity. After processing the first mini-batch, the first GPU hands off to the second GPU and starts processing the second mini-batch. If there are enough mini-batches, the first GPU can stay busy processing those while the others perform their forward and backward passes and the backward pass of the first mini-batch comes back to the first GPU.
While pipeline parallelism shards the model across GPUs horizontally (i.e., by layers), tensor parallelism shards the model vertically. Consider the previous example or a model with 16 layers. Instead of putting 2 complete layers on each of the 8 GPUs, I can keep 1/8th of each layer on each GPU. Each GPU performs the entire forward and backward pass for the entire model (i.e., all the layers) but just of its part of layers. The key challenge with tensor parallelism is that not all layers of a model are amenable to such vertical splitting. This only works for layers that perform matrix multiplications that can be computed independently on sub-components of the matrices. As a result, typically tensor parallelism only splits some of the layers of a model.
Tensor parallelism, pipeline parallelism, and data parallelism are often used in conjunction.
Fully Sharded Data Parallel
FSDP takes a slightly different approach than the above parallelism strategies. It keeps the data parallelism part as is – each data parallel replica operates on different subsets of the data. It also keeps a flavor of model sharding but it does not specify how to shard the model (horizontally as in pipeline parallelism and/or vertically as in tensor parallelism). The model is sharded arbitrarily and each GPU "owns" (i.e., is primarily responsible for storing the parameters of) certain parts (or the entirety) of certain layers. The reason that a specific sharding strategy is not required is that unlike tensor and pipeline parallelism, the computation is not sharded. That is, instead of each GPU operating on its shard of the model (e.g. some specific layers of the model or some subpart of multiple layers), each GPU performs the computation for all layers (and the full layer). To do so, each GPU collects the relevant layer parameters that it wants to compute on from the GPU(s) that own the layer. After performing the computation on a layer, it is discarded from the GPU memory (unless the GPU owns the layer) and computation moves on to the next layer. The main advantage of FSDP is that it is unaffected by a layer’s amenability to tensor parallelism and it does not require complex orchestration to keep the GPUs busy as in pipeline parallelism.
FSDP is also called ZeRO-3. The 3 refers to the three types of parameters being sharded: optimizer state, gradients, and weights. ZeRO-1 and ZeRO-2 shard only a subset of parameters. There is also ZeRO-offload and ZeRO-infinity that offload the layer parameters to CPU memory and to NVMe attached SSDs after computation instead of discarding them, respectively.
References and links for further detailed reading
This paper has a great explanation of data, tensor, and pipeline parallelism and how to combine them.
Tensor parallelism was introduced in the megatron paper
Pipedream is one of the canonical papers about pipeline parallelism (along with other contemporary papers like GPipe also):
ZeRO parallelism series:
Original paper covering ZeRO-1,2,3
ZeRO-infinity paper
ZeRO-offload paper
Meta’s blog has a good illustration of FSDP
NCCL documentation has good illustration of the different types of communications (particularly useful to help FSDP as described in the above Meta blog)