Gluon `Dataset`s and `DataLoader`¶

One of the most critical steps for model training and inference is loading the data: without data you can’t do Machine Learning! In this tutorial we use the Gluon API to define a Dataset and use a DataLoader to iterate through the dataset in mini-batches.

Introduction to `Dataset`s¶

Dataset objects are used to represent collections of data, and include methods to load and parse the data (that is often stored on disk). Gluon has a number of different Dataset classes for working with image data straight out-of-the-box, but we’ll use the ArrayDataset to introduce the idea of a Dataset.

We first start by generating random data X (with 3 variables) and corresponding random labels y to simulate a typical supervised learning task. We generate 10 samples and we pass them all to the ArrayDataset.

import mxnet as mx
import os
import tarfile

mx.random.seed(42) # Fix the seed for reproducibility
X = mx.random.uniform(shape=(10, 3))
y = mx.random.uniform(shape=(10, 1))
dataset = mx.gluon.data.dataset.ArrayDataset(X, y)

A key feature of a Dataset is the ability to retrieve a single sample given an index. Our random data and labels were generated in memory, so this ArrayDataset doesn’t have to load anything from disk, but the interface is the same for all Datasets.

sample_idx = 4
sample = dataset[sample_idx]

assert len(sample) == 2
assert sample[0].shape == (3, )
assert sample[1].shape == (1, )
print(sample)

(
 [ 0.4375872   0.29753461  0.89177299]
 <NDArray 3 @cpu(0)>,
 [ 0.83261985]
 <NDArray 1 @cpu(0)>)

We get a tuple of a data sample and its corresponding label, which makes sense because we passed the data X and the labels y in that order when we instantiated the ArrayDataset. We don’t usually retrieve individual samples from Dataset objects though (unless we’re quality checking the output samples). Instead we use a DataLoader.

Introduction to `DataLoader`¶

A DataLoader is used to create mini-batches of samples from a Dataset, and provides a convenient iterator interface for looping these batches. It’s typically much more efficient to pass a mini-batch of data through a neural network than a single sample at a time, because the computation can be performed in parallel. A required parameter of DataLoader is the size of the mini-batches you want to create, called batch_size.

Another benefit of using DataLoader is the ability to easily load data in parallel using multiprocessing. You can set the num_workers parameter to the number of CPUs avalaible on your machine for maximum performance, or limit it to a lower number to spare resources.

from multiprocessing import cpu_count
CPU_COUNT = cpu_count()

data_loader = mx.gluon.data.DataLoader(dataset, batch_size=5, num_workers=CPU_COUNT)

for X_batch, y_batch in data_loader:
    print("X_batch has shape {}, and y_batch has shape {}".format(X_batch.shape, y_batch.shape))

X_batch has shape (5, 3), and y_batch has shape (5, 1)

We can see 2 mini-batches of data (and labels), each with 5 samples, which makes sense given we started with a dataset of 10 samples. When comparing the shape of the batches to the samples returned by the Dataset, we’ve gained an extra dimension at the start which is sometimes called the batch axis.

Our data_loader loop will stop when every sample of dataset has been returned as part of a batch. Sometimes the dataset length isn’t divisible by the mini-batch size, leaving a final batch with a smaller number of samples. DataLoader‘s default behavior is to return this smaller mini-batch, but this can be changed by setting the last_batch parameter to discard (which ignores the last batch) or rollover (which starts the next epoch with the remaining samples).

Machine learning with `Dataset`s and `DataLoader`s¶

You will often use a few different Dataset objects in your Machine Learning project. It’s essential to separate your training dataset from testing dataset, and it’s also good practice to have validation dataset (a.k.a. development dataset) that can be used for optimising hyperparameters.

Using Gluon Dataset objects, we define the data to be included in each of these separate datasets. Common use cases for loading data are covered already (e.g. mxnet.gluon.data.vision.datasets.ImageFolderDataset), but it’s simple to create your own custom Dataset classes for other types of data. You can even use included Dataset objects for common datasets if you want to experiment quickly; they download and parse the data for you! In this example we use the Fashion MNIST dataset from Zalando Research.

Many of the image Datasets accept a function (via the optional transform parameter) which is applied to each sample returned by the Dataset. It’s useful for performing data augmentation, but can also be used for more simple data type conversion and pixel value scaling as seen below.

def transform(data, label):
    data = data.astype('float32')/255
    return data, label

train_dataset = mx.gluon.data.vision.datasets.FashionMNIST(train=True, transform=transform)
valid_dataset = mx.gluon.data.vision.datasets.FashionMNIST(train=False, transform=transform)

%matplotlib inline
from matplotlib.pylab import imshow

sample_idx = 234
sample = train_dataset[sample_idx]
data = sample[0]
label = sample[1]
label_desc = {0:'T-shirt/top', 1:'Trouser', 2:'Pullover', 3:'Dress', 4:'Coat', 5:'Sandal', 6:'Shirt', 7:'Sneaker', 8:'Bag', 9:'Ankle boot'}

imshow(data[:,:,0].asnumpy(), cmap='gray')
print("Data type: {}".format(data.dtype))
print("Label: {}".format(label))
print("Label description: {}".format(label_desc[label]))

Data type: 'numpy.float32'>

Label: 8

Label description: Bag

png

When training machine learning models it is important to shuffle the training samples every time you pass through the dataset (i.e. each epoch). Sometimes the order of your samples will have a spurious relationship with the target variable, and shuffling the samples helps remove this. With DataLoader it’s as simple as adding shuffle=True. You don’t need to shuffle the validation and testing data though.

If you have more complex shuffling requirements (e.g. when handling sequential data), take a look at mxnet.gluon.data.BatchSampler and pass this to your DataLoader instead.

batch_size = 32
train_data_loader = mx.gluon.data.DataLoader(train_dataset, batch_size, shuffle=True, num_workers=CPU_COUNT)
valid_data_loader = mx.gluon.data.DataLoader(valid_dataset, batch_size, num_workers=CPU_COUNT)

With both DataLoaders defined, we can now train a model to classify each image and evaluate the validation loss at each epoch. Our Fashion MNIST dataset has 10 classes including shirt, dress, sneakers, etc. We define a simple fully connected network with a softmax output and use cross entropy as our loss.

from mxnet import gluon, autograd, ndarray

def construct_net():
    net = gluon.nn.HybridSequential()
    with net.name_scope():
        net.add(gluon.nn.Dense(128, activation="relu"))
        net.add(gluon.nn.Dense(64, activation="relu"))
        net.add(gluon.nn.Dense(10))
    return net

# construct and initialize network.
ctx =  mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()

net = construct_net()
net.hybridize()
net.initialize(mx.init.Xavier(), ctx=ctx)
# define loss and trainer.
criterion = gluon.loss.SoftmaxCrossEntropyLoss()
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1})

epochs = 5
for epoch in range(epochs):
    # training loop (with autograd and trainer steps, etc.)
    cumulative_train_loss = mx.nd.zeros(1, ctx=ctx)
    training_samples = 0
    for batch_idx, (data, label) in enumerate(train_data_loader):
        data = data.as_in_context(ctx).reshape((-1, 784)) # 28*28=784
        label = label.as_in_context(ctx)
        with autograd.record():
            output = net(data)
            loss = criterion(output, label)
        loss.backward()
        trainer.step(data.shape[0])
        cumulative_train_loss += loss.sum()
        training_samples += data.shape[0]
    train_loss = cumulative_train_loss.asscalar()/training_samples

    # validation loop
    cumulative_valid_loss = mx.nd.zeros(1, ctx)
    valid_samples = 0
    for batch_idx, (data, label) in enumerate(valid_data_loader):
        data = data.as_in_context(ctx).reshape((-1, 784)) # 28*28=784
        label = label.as_in_context(ctx)
        output = net(data)
        loss = criterion(output, label)
        cumulative_valid_loss += loss.sum()
        valid_samples += data.shape[0]
    valid_loss = cumulative_valid_loss.asscalar()/valid_samples

    print("Epoch {}, training loss: {:.2f}, validation loss: {:.2f}".format(epoch, train_loss, valid_loss))

Epoch 0, training loss: 0.54, validation loss: 0.45

...

Epoch 4, training loss: 0.32, validation loss: 0.33

Gluon `Dataset`s and `DataLoader`¶

Introduction to `Dataset`s¶

Introduction to `DataLoader`¶

Machine learning with `Dataset`s and `DataLoader`s¶

Using own data with included `Dataset`s¶

Using own data with custom `Dataset`s¶

Appendix: Upgrading from Module `DataIter` to Gluon `DataLoader`¶

Table Of Contents

Gluon Datasets and DataLoader¶

Introduction to Datasets¶

Introduction to DataLoader¶

Machine learning with Datasets and DataLoaders¶

Using own data with included Datasets¶

Using own data with custom Datasets¶

Appendix: Upgrading from Module DataIter to Gluon DataLoader¶

Gluon `Dataset`s and `DataLoader`¶

Introduction to `Dataset`s¶

Introduction to `DataLoader`¶

Machine learning with `Dataset`s and `DataLoader`s¶

Using own data with included `Dataset`s¶

Using own data with custom `Dataset`s¶

Appendix: Upgrading from Module `DataIter` to Gluon `DataLoader`¶