Use data from S3 for training

AWS S3 is a cloud-based object storage service that allows storage and retrieval of large amounts of data at a very low cost. This makes it an attractive option to store large training datasets. MXNet is deeply integrated with S3 for this purpose.

An S3 protocol URL (like s3://bucket-name/training-data) can be provided as a parameter for any data iterator that takes a file path as input. For example,

data_iter = mx.io.ImageRecordIter(
    path_imgrec="s3://bucket-name/training-data/caltech_train.rec",
    data_shape=(3, 227, 227),
    batch_size=4,
    resize=256)

Following are detailed instructions on how to use data from S3 for training.

Step 1: Build MXNet with S3 integration enabled

Follow instructions here to install MXNet from source with the following additional steps to enable S3 integration.

  1. Install libcurl4-openssl-dev and libssl-dev before building MXNet. These packages are required to read/write from AWS S3.
  2. Append USE_S3=1 to config.mk before building MXNet. echo "USE_S3=1" >> config.mk

Step 2: Configure S3 authentication tokens

MXNet requires the S3 environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to be set. Here are instructions to get the access keys from AWS console.

export AWS_ACCESS_KEY_ID=<your-access-key-id>
AWS_SECRET_ACCESS_KEY=<your-secret-access-key>

Step 3: Upload data to S3

There are several ways to upload data to S3. One easy way is to use the AWS command line utility. For example, the following sync command will recursively copy contents from a local directory to a directory in S3.

aws s3 sync ./training-data s3://bucket-name/training-data

Step 4: Train with data from S3

Once the data is in S3, it is very straightforward to use it from MXNet. Any data iterator that can read/write data from a local drive can also read/write data from S3.

Let's modify an existing example code in MXNet repository to read data from S3 instead of local disk. mxnet/tests/python/train/test_conv.py trains a convolutional network using MNIST data from local disk. We'll do the following change to read the data from S3 instead.

~/mxnet$ sed -i -- 's/data\//s3:\/\/bucket-name\/training-data\//g' ./tests/python/train/test_conv.py

~/mxnet$ git diff ./tests/python/train/test_conv.py
diff --git a/tests/python/train/test_conv.py b/tests/python/train/test_conv.py
index 039790e..66a60ce 100644
--- a/tests/python/train/test_conv.py
+++ b/tests/python/train/test_conv.py
@@ -39,14 +39,14 @@ def get_iters():

     batch_size = 100
     train_dataiter = mx.io.MNISTIter(
-            image="data/train-images-idx3-ubyte",
-            label="data/train-labels-idx1-ubyte",
+            image="s3://bucket-name/training-data/train-images-idx3-ubyte",
+            label="s3://bucket-name/training-data/train-labels-idx1-ubyte",
             data_shape=(1, 28, 28),
             label_name='sm_label',
             batch_size=batch_size, shuffle=True, flat=False, silent=False, seed=10)
     val_dataiter = mx.io.MNISTIter(
-            image="data/t10k-images-idx3-ubyte",
-            label="data/t10k-labels-idx1-ubyte",
+            image="s3://bucket-name/training-data/t10k-images-idx3-ubyte",
+            label="s3://bucket-name/training-data/t10k-labels-idx1-ubyte",
             data_shape=(1, 28, 28),
             label_name='sm_label',
             batch_size=batch_size, shuffle=True, flat=False, silent=False)

After the above change test_conv.py will fetch data from S3 instead of the local disk.

python ./tests/python/train/test_conv.py
[21:59:19] src/io/s3_filesys.cc:878: No AWS Region set, using default region us-east-1
[21:59:21] src/io/iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(100,1,28,28)
[21:59:21] src/io/iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(100,1,28,28)
INFO:root:Start training with [cpu(0)]
Start training with [cpu(0)]
INFO:root:Epoch[0] Resetting Data Iterator
Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=11.277
Epoch[0] Time cost=11.277
INFO:root:Epoch[0] Validation-accuracy=0.955100
Epoch[0] Validation-accuracy=0.955100
INFO:root:Finish fit...
Finish fit...
INFO:root:Finish predict...
Finish predict...
INFO:root:final accuracy = 0.955100
final accuracy = 0.955100