- A Beginner's Guide to Implementing Operators in MXNet Backend
- Convert from Caffe to MXNet
- MXNet on the Cloud
- Distributed Training in MXNet
- Environment Variables
- Model Parallel
- Create New Operators
- NNPACK for Multi-Core CPU Support in MXNet
- Some Tips for Improving MXNet Performance
- Create a Dataset Using RecordIO
- Use data from S3 for training
- MXNet Security Best Practices
- Deep Learning at the Edge
- Visualize Neural Networks
- Why MXNet came to be?
Training with Multiple GPUs Using Model Parallelism
Training deep learning models can be resource intensive. Even with a powerful GPU, some models can take days or weeks to train. Large long short-term memory (LSTM) recurrent neural networks can be especially slow to train, with each layer, at each time step, requiring eight matrix multiplications. Fortunately, given cloud services like AWS, machine learning practitioners often have access to multiple machines and multiple GPUs. One key strength of MXNet is its ability to leverage powerful heterogeneous hardware environments to achieve significant speedups.
There are two primary ways that we can spread a workload across multiple devices. In a previous document, we addressed data parallelism, an approach in which samples within a batch are divided among the available devices. With data parallelism, each device stores a complete copy of the model. Here, we explore model parallelism, a different approach. Instead of splitting the batch among the devices, we partition the model itself. Most commonly, we achieve model parallelism by assigning the parameters (and computation) of different layers of the network to different devices.
In particular, we will focus on LSTM recurrent networks. LSTMS are powerful sequence models, that have proven especially useful for natural language translation, speech recognition, and working with time series data. For a general high-level introduction to LSTMs, see the excellent tutorial by Christopher Olah.
Model Parallelism: Using Multiple GPUs As a Pipeline
Model parallelism in deep learning was first proposed for the extraordinarily large convolutional layer in GoogleNet. From this implementation, we take the idea of placing each layer on a separate GPU. Using model parallelism in such a layer-wise fashion provides the benefit that no GPU has to maintain all of the model parameters in memory.
In the preceding figure, each LSTM layer is assigned to a different GPU. After GPU 1 finishes computing layer 1 for the first sentence, it passes its output to GPU 2. At the same time, GPU 1 fetches the next sentence and starts training. This differs significantly from data parallelism. Here, there is no contention to update the shared model at the end of each iteration, and most of the communication happens when passing intermediate results between GPUs.
Implementing model parallelism requires knowledge of the training task. Here are some general heuristics that we find useful:
- To minimize communication time, place neighboring layers on the same GPUs.
- Be careful to balance the workload between GPUs.
- Remember that different kinds of layers have different computation-memory properties.
Let's take a quick look at the two pipelines in the preceding diagram. They both have eight layers with a decoder and an encoder layer. Based on our first principle, it's unwise to place all neighboring layers on separate GPUs. We also want to balance the workload across GPUs. Although the LSTM layers consume less memory than the decoder/encoder layers, they consume more computation time because of the dependency of the unrolled LSTM. Thus, the partition on the left will be faster than the one on the right because the workload is more evenly distributed.