Trainer¶
Training a neural network model consists of iteratively performing three simple steps.
The first step is the forward step which computes the loss. In MXNet Gluon, this first step is achieved by doing a forward pass by calling net.forward(X)
or simply net(X)
and then calling the loss function with the result of the forward pass and the labels. For example l = loss_fn(net(X), y)
.
The second step is the backward step which computes the gradient of the loss with respect to the parameters. In Gluon, this step is achieved by doing the first step in an autograd.record() scope to record the computations needed to calculate the loss, and then calling l.backward()
to compute the gradient of the loss with respect to the parameters.
The final step is to update the neural network model parameters using an optimization algorithm. In Gluon, this step is performed by the gluon.Trainer and is the subject of this guide. When creating a Gluon Trainer
you must provide a collection of parameters that need to be learnt. You also provide an Optimizer
that will be used to update the parameters every training iteration when trainer.step
is called.
Basic Usage¶
Network and Trainer¶
To illustrate how to use the Gluon Trainer
we will create a simple perceptron model and create a Trainer
instance using the perceptron model parameters and a simple optimizer - sgd
with learning rate as 1.
[1]:
from mxnet import np, autograd, optimizer, gluon
net = gluon.nn.Dense(1)
net.initialize()
trainer = gluon.Trainer(net.collect_params(),
optimizer='sgd', optimizer_params={'learning_rate':1})
[04:51:16] /work/mxnet/src/storage/storage.cc:202: Using Pooled (Naive) StorageManager for CPU
Forward and Backward Pass¶
Before we can use the trainer
to update model parameters, we must first run the forward and backward passes. Here we implement a function to compute the first two steps (forward step and backward step) of training the perceptron on a random dataset.
[2]:
batch_size = 8
X = np.random.uniform(size=(batch_size, 4))
y = np.random.uniform(size=(batch_size,))
loss = gluon.loss.L2Loss()
def forward_backward():
with autograd.record():
l = loss(net(X), y)
l.backward()
forward_backward()
Warning: It is extremely important that the gradients of the loss function with respect to your model parameters are computed before running trainer step
. A common way to introduce bugs to your model training code is to omit the loss.backward()
before the update step.
Before updating, let’s check the current network parameters.
[3]:
curr_weight = net.weight.data().copy()
print(curr_weight)
[[ 0.00427801 0.01983307 0.05400988 -0.03179503]]
Trainer
step¶
Now we will call the step
method to perform one update. We provide the batch_size
as an argument to normalize the size of the gradients and make it independent of the batch size. Otherwise we’d get larger gradients with larger batch sizes. We can see the network parameters have now changed.
[4]:
trainer.step(batch_size)
print(net.weight.data())
[[0.14192605 0.27255496 0.27619216 0.12496824]]
Since we used plain SGD, the update rule is \(w = w - \eta/b \nabla \ell\), where \(b\) is the batch size and \(\nabla\ell\) is the gradient of the loss function with respect to the weights and \(\eta\) is the learning rate.
We can verify it by running the following code snippet which is explicitly performing the SGD update.
[5]:
print(curr_weight - net.weight.grad() * 1 / batch_size)
[[0.14192605 0.27255496 0.27619216 0.12496824]]
Advanced Usage¶
Using Optimizer Instance¶
In the previous example, we use the string argument sgd
to select the optimization method, and optimizer_params
to specify the optimization method arguments.
All pre-defined optimization methods can be passed in this way and the complete list of implemented optimizers is provided in the mxnet.optimizer module.
However we can also pass an optimizer instance directly to the Trainer
constructor.
For example:
[6]:
optim = optimizer.Adam(learning_rate = 1)
trainer = gluon.Trainer(net.collect_params(), optim)
[7]:
forward_backward()
trainer.step(batch_size)
net.weight.data()
[7]:
array([[-0.8580791 , -0.72745013, -0.7238127 , -0.87503594]])
For reference and implementation details about each optimizer, please refer to the guide and API doc for the optimizer
module.
KVStore Options¶
The Trainer
constructor also accepts the following keyword arguments for :
kvstore
– how key value store should be created for multi-gpu and distributed training. Check out mxnet.kvstore.KVStore for more information. String options are any of the following [‘local’, ‘device’, ‘dist_device_sync’, ‘dist_device_async’].compression_params
– Specifies type of gradient compression and additional arguments depending on the type of compression being used. See mxnet.KVStore.set_gradient_compression_method for more details on gradient compression.update_on_kvstore
– Whether to perform parameter updates on KVStore. If None, then theTrainer
instance will choose the more suitable option depending on the type of KVStore.
Changing the Learning Rate¶
We set the initial learning rate when creating a trainer by passing the learning rate as an optimizer_param
. However, sometimes we may need to change the learning rate during training, for example when doing an explicit learning rate warmup schedule. The trainer instance provides an easy way to achieve this.
The current training rate can be accessed through the learning_rate
attribute.
[8]:
trainer.learning_rate
[8]:
1
We can change it through the set_learning_rate
method.
[9]:
trainer.set_learning_rate(0.1)
trainer.learning_rate
[9]:
0.1
In addition, there are multiple pre-defined learning rate scheduling methods that are already implemented in the mxnet.lr_scheduler module. The learning rate schedulers can be incorporated into your trainer by passing them in as an optimizer_param
entry. Please refer to the LR scheduler guide to learn more.
Summary¶
The MXNet Gluon
Trainer
API is used to update the parameters of a network with a particular optimization algorithm.After the forward and backward pass, the model update step is done in Gluon using
trainer.step()
.A Gluon
Trainer
can be instantiated by passing in the name of the optimizer to use and theoptimizer_params
for that optimizer or alternatively by passing in an instance ofmxnet.optimizer.Optimizer
.You can change the learning rate for a Gluon
Trainer
by setting the member variable but Gluon also provides a module for learning rate scheduling.
Next Steps¶
While optimization and optimizers play a significant role in deep learning model training, there are still other important components to model training. Here are a few suggestions about where to look next.
The Optimizer API and optimizer guide have information about all the different optimizers implemented in MXNet and their update steps. The Dive into Deep Learning book also has a chapter dedicated to optimization methods and explains various key optimizers in great detail.
Take a look at the guide to parameter initialization in MXNet to learn about what initialization schemes are already implemented, and how to implement your custom initialization schemes.
Also check out this guide on parameter management to learn about how to manage model parameters in gluon.
Make sure to take a look at the guide to scheduling learning rates to learn how to create learning rate schedules to make your training converge faster.
Finally take a look at the KVStore API to learn how parameter values are synchronized over multiple devices.