Train a Linear Regression Model with Sparse Symbols

In previous tutorials, we introduced CSRNDArray and RowSparseNDArray, the basic data structures for manipulating sparse data. MXNet also provides Sparse Symbol API, which enables symbolic expressions that handle sparse arrays. In this tutorial, we first focus on how to compose a symbolic graph with sparse operators, then train a linear regression model using sparse symbols with the Module API.


To complete this tutorial, we need:

pip install jupyter requests


Variables are placeholder for arrays. We can use them to hold sparse arrays too.

Variable Storage Types

The stype attribute of a variable is used to indicate the storage type of the array. By default, the stype of a variable is “default” which indicates the default dense storage format. We can specify the stype of a variable as “csr” or “row_sparse” to hold sparse arrays.

import mxnet as mx
import numpy as np
import random

# set the seeds for repeatability

# Create a variable to hold an NDArray
a = mx.sym.Variable('a')
# Create a variable to hold a CSRNDArray
b = mx.sym.Variable('b', stype='csr')
# Create a variable to hold a RowSparseNDArray
c = mx.sym.Variable('c', stype='row_sparse')
(a, b, c)

(<Symbol a>, <Symbol b>, <Symbol c>)

Bind with Sparse Arrays

The sparse symbols constructed above declare storage types of the arrays to hold. To evaluate them, we need to feed the free variables with sparse data.

You can instantiate an executor from a sparse symbol by using the simple_bind method, which allocate zeros to all free variables according to their storage types. The executor provides forward method for evaluation and an attribute outputs to get all the results. Later, we will show the use of the backward method and other methods computing the gradients and updating parameters. A simple example first:

shape = (2,2)
# Instantiate an executor from sparse symbols
b_exec = b.simple_bind(ctx=mx.cpu(), b=shape)
c_exec = c.simple_bind(ctx=mx.cpu(), c=shape)
# Sparse arrays of zeros are bound to b and c
print(b_exec.outputs, c_exec.outputs)
<CSRNDArray 2x2 @cpu(0)>], [
<RowSparseNDArray 2x2 @cpu(0)>])

You can update the array held by the variable by accessing executor’s arg_dict and assigning new values.

b_exec.arg_dict['b'][:] = mx.nd.ones(shape).tostype('csr')
# The array `b` holds are updated to be ones
eval_b = b_exec.outputs[0]
{'eval_b': eval_b, 'eval_b.asnumpy()': eval_b.asnumpy()}
 <CSRNDArray 2x2 @cpu(0)>, 'eval_b.asnumpy()': array([[ 1.,  1.],
        [ 1.,  1.]], dtype=float32)}

Symbol Composition and Storage Type Inference

Basic Symbol Composition

The following example builds a simple element-wise addition expression with different storage types. The sparse symbols are available in the mx.sym.sparse package.

# Element-wise addition of variables with "default" stype
d = mx.sym.elemwise_add(a, a)
# Element-wise addition of variables with "csr" stype
e = mx.sym.sparse.negative(b)
# Element-wise addition of variables with "row_sparse" stype
f = mx.sym.sparse.elemwise_add(c, c)
{'d':d, 'e':e, 'f':f}
{'d': <Symbol elemwise_add0>,
 'e': <Symbol negative0>,
 'f': <Symbol elemwise_add1>}

Storage Type Inference

What will be the output storage types of sparse symbols? In MXNet, for any sparse symbol, the result storage types are inferred based on storage types of inputs. You can read the Sparse Symbol API documentation to find what output storage types are. In the example below we will try out the storage types introduced in the Row Sparse and Compressed Sparse Row tutorials: default (dense), csr, and row_sparse.

add_exec = mx.sym.Group([d, e, f]).simple_bind(ctx=mx.cpu(), a=shape, b=shape, c=shape)
dense_add = add_exec.outputs[0]
# The output storage type of elemwise_add(csr, csr) will be inferred as "csr"
csr_add = add_exec.outputs[1]
# The output storage type of elemwise_add(row_sparse, row_sparse) will be inferred as "row_sparse"
rsp_add = add_exec.outputs[2]
{'dense_add.stype': dense_add.stype, 'csr_add.stype':csr_add.stype, 'rsp_add.stype': rsp_add.stype}
{'csr_add.stype': 'csr',
 'dense_add.stype': 'default',
 'rsp_add.stype': 'row_sparse'}

Storage Type Fallback

For operators that don’t specialize in certain sparse arrays, you can still use them with sparse inputs with some performance penalty. In MXNet, dense operators require all inputs and outputs to be in the dense format. If sparse inputs are provided, MXNet will convert sparse inputs into dense ones temporarily so that the dense operator can be used. If sparse outputs are provided, MXNet will convert the dense outputs generated by the dense operator into the provided sparse format. Warning messages will be printed when such a storage fallback event happens.

# `log` operator doesn't support sparse inputs at all, but we can fallback on the dense implementation
csr_log = mx.sym.log(a)
# `elemwise_add` operator doesn't support adding csr with row_sparse, but we can fallback on the dense implementation
csr_rsp_add = mx.sym.elemwise_add(b, c)
fallback_exec = mx.sym.Group([csr_rsp_add, csr_log]).simple_bind(ctx=mx.cpu(), a=shape, b=shape, c=shape)
fallback_add = fallback_exec.outputs[0]
fallback_log = fallback_exec.outputs[1]
{'fallback_add': fallback_add, 'fallback_log': fallback_log}
 [[ 0.  0.]
  [ 0.  0.]]
 <NDArray 2x2 @cpu(0)>, 'fallback_log':
 [[-inf -inf]
  [-inf -inf]]
 <NDArray 2x2 @cpu(0)>}

Inspecting Storage Types of the Symbol Graph

When the environment variable MXNET_INFER_STORAGE_TYPE_VERBOSE_LOGGING is set to 1, MXNet will log the storage type information of operators’ inputs and outputs in the computation graph. For example, we can inspect the storage types of a linear classification network with sparse operators. Uncomment the line below and inspect your console.:

# Set logging level for executor
import mxnet as mx
import os
# Data in csr format
data = mx.sym.var('data', stype='csr', shape=(32, 10000))
# Weight in row_sparse format
weight = mx.sym.var('weight', stype='row_sparse', shape=(10000, 2))
bias = mx.symbol.Variable("bias", shape=(2,))
dot =, weight)
pred = mx.symbol.broadcast_add(dot, bias)
y = mx.symbol.Variable("label")
output = mx.symbol.SoftmaxOutput(data=pred, label=y, name="output")
executor = output.simple_bind(ctx=mx.cpu())

Training with Module APIs

In the following section we’ll walk through how one can implement linear regression using sparse symbols and sparse optimizers.

The function you will explore is: y = x1 + 2x2 + … 100x100, where (x1,x2, …, x100) are input features and y is the corresponding label.

Preparing the Data

In MXNet, both and support loading sparse data in CSR format. In this example, we’ll use the NDArrayIter.

You may see some warnings from SciPy. You don’t need to worry about those for this example.

# Random training data
feature_dimension = 100
train_data = mx.test_utils.rand_ndarray((1000, feature_dimension), 'csr', 0.01)
target_weight = mx.nd.arange(1, feature_dimension + 1).reshape((feature_dimension, 1))
train_label =, target_weight)
batch_size = 1
train_iter =, train_label, batch_size, last_batch_handle='discard', label_name='label')

Defining the Model

Below is an example of a linear regression model specifying the storage type of the variables.

initializer = mx.initializer.Normal(sigma=0.01)
X = mx.sym.Variable('data', stype='csr')
Y = mx.symbol.Variable('label')
weight = mx.symbol.Variable('weight', stype='row_sparse', shape=(feature_dimension, 1), init=initializer)
bias = mx.symbol.Variable('bias', shape=(1, ))
pred = mx.sym.broadcast_add(, weight), bias)
lro = mx.sym.LinearRegressionOutput(data=pred, label=Y, name="lro")

The above network uses the following symbols:

  1. Variable X: The placeholder for sparse data inputs. The csr stype indicates that the array to hold is in CSR format.

  2. Variable Y: The placeholder for dense labels.

  3. Variable weight: The placeholder for the weight to learn. The stype of weight is specified as row_sparse so that it is initialized as RowSparseNDArray, and the optimizer will perform sparse update rules on it. The init attribute specifies what initializer to use for this variable.

  4. Variable bias: The placeholder for the bias to learn.

  5. The dot product operation of X and weight. The sparse implementation will be invoked to handle csr and row_sparse inputs.

  6. broadcast_add: The broadcasting add operation to apply bias.

  7. LinearRegressionOutput: The output layer which computes l2 loss against its input and the labels provided to it.

Training the model

Once we have defined the model structure, the next step is to create a module and initialize the parameters and optimizer.

# Create module
mod = mx.mod.Module(symbol=lro, data_names=['data'], label_names=['label'])
# Allocate memory by giving the input data and label shapes
mod.bind(data_shapes=train_iter.provide_data, label_shapes=train_iter.provide_label)
# Initialize parameters by random numbers
# Use SGD as the optimizer, which performs sparse update on "row_sparse" weight
sgd = mx.optimizer.SGD(learning_rate=0.05, rescale_grad=1.0/batch_size, momentum=0.9)

Finally, we train the parameters of the model to fit the training data by using the forward, backward, and update methods in Module.

# Use mean square error as the metric
metric = mx.metric.create('MSE')
# Train 10 epochs
for epoch in range(10):
    for batch in train_iter:
        mod.forward(batch, is_train=True)       # compute predictions
        mod.update_metric(metric, batch.label)  # accumulate prediction accuracy
        mod.backward()                          # compute gradients
        mod.update()                            # update parameters
    print('Epoch %d, Metric = %s' % (epoch, metric.get()))
assert metric.get()[1] < 1, "Achieved MSE (%f) is larger than expected (1.0)" % metric.get()[1]

Epoch 9, Metric = ('mse', 0.35979430613957991)

Training the model with multiple machines or multiple devices

Distributed training with row_sparse weights and gradients are supported in MXNet, which significantly reduces communication cost for large models. To train a sparse model with multiple machines, you need to call prepare before forward, or save_checkpoint. Please refer to the example in mxnet/example/sparse/linear_classification for more details.