Gluon - Neural network building blocks

Gluon package is a high-level interface for MXNet designed to be easy to use while keeping most of the flexibility of low level API. Gluon supports both imperative and symbolic programming, making it easy to train complex models imperatively in Python and then deploy with symbolic graph in C++ and Scala.

# import dependencies
from __future__ import print_function
import numpy as np
import mxnet as mx
import mxnet.ndarray as F
import mxnet.gluon as gluon
from mxnet.gluon import nn
from mxnet import autograd

Neural networks (and other machine learning models) can be defined and trained with gluon.nn and gluon.rnn package. A typical training script has the following steps:

  • Define network
  • Initialize parameters
  • Loop over inputs
  • Forward input through network to get output
  • Compute loss with output and label
  • Backprop gradient
  • Update parameters with gradient descent.

Define Network

gluon.Block is the basic building block of models. You can define networks by composing and inheriting Block:

class Net(gluon.Block):
    def __init__(self, **kwargs):
        super(Net, self).__init__(**kwargs)
        with self.name_scope():
            # layers created in name_scope will inherit name space
            # from parent layer.
            self.conv1 = nn.Conv2D(6, kernel_size=5)
            self.pool1 = nn.MaxPool2D(pool_size=(2,2))
            self.conv2 = nn.Conv2D(16, kernel_size=5)
            self.pool2 = nn.MaxPool2D(pool_size=(2,2))
            self.fc1 = nn.Dense(120)
            self.fc2 = nn.Dense(84)
            self.fc3 = nn.Dense(10)

    def forward(self, x):
        x = self.pool1(F.relu(self.conv1(x)))
        x = self.pool2(F.relu(self.conv2(x)))
        # 0 means copy over size from corresponding dimension.
        # -1 means infer size from the rest of dimensions.
        x = x.reshape((0, -1))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Initialize Parameters

A network must be created and initialized before it can be used:

net = Net()
# Initialize on CPU. Replace with `mx.gpu(0)`, or `[mx.gpu(0), mx.gpu(1)]`,
# etc to use one or more GPUs.
net.collect_params().initialize(mx.init.Xavier(), ctx=mx.cpu())

Note that because we didn’t specify input size to layers in Net’s constructor, the shape of parameters cannot be determined at this point. Actual initialization is deferred to the first forward pass, i.e. if you access now an exception will be raised.

You can actually initialize the weights by running a forward pass:

data = mx.nd.random_normal(shape=(10, 1, 32, 32))  # dummy data
output = net(data)

Or you can specify input size when creating layers, i.e. nn.Dense(84, in_units=120) instead of nn.Dense(84).

Loss Functions

Loss functions take (output, label) pairs and compute a scalar loss for each sample in the mini-batch. The scalars measure how far each output is from the label.

There are many predefined loss functions in gluon.loss. Here we use softmax_cross_entropy_loss for digit classification.

To compute loss and backprop for one iteration, we do:

label = mx.nd.arange(10)  # dummy label
with autograd.record():
    output = net(data)
    loss = gluon.loss.softmax_cross_entropy_loss(output, label)
print('loss:', loss)
print('grad:', net.fc1.weight.grad())

Updating the weights

Now that gradient is computed, we just need to update the weights. This is usually done with formulas like weight = weight - learning_rate * grad / batch_size. Note we divide gradient by batch_size because gradient is aggregated over the entire batch. For example,

lr = 0.01
for p in net.collect_params().values():[:] -= lr / data.shape[0] * p.grad()

But sometimes you want more fancy updating rules like momentum and Adam, and since this is a commonly used functionality, gluon provide a Trainer class for it:

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01})

with record():
    output = net(data)
    loss = gluon.loss.softmax_cross_entropy_loss(output, label)

# do the update. Trainer needs to know the batch size of data to normalize
# the gradient by 1/batch_size.