Using MXNet with Large Tensor Support
Using MXNet with Large Tensor Support
What is large tensor support?
When creating a network that uses large amounts of data, as in a deep graph problem, you may need large tensor support. This is a relatively new feature as tensors are indexed in MXNet using INT32 indices by default. Now MXNet build with Large Tensor supports INT64 indices.
It is MXNet built with an additional flag USE_INT64_TENSOR_SIZE=1 in CMAKE it is built using USE_INT64_TENSOR_SIZE:“ON”
When do you need it?
- When you are creating NDArrays of size larger than 2^31 elements.
- When the input to your model requires tensors that have inputs larger than 2^31 (when you load them all at once in your code) or attributes greater than 2^31.
How to identify that you need to use large tensors ?
When you see one of the following errors:
- OverflowError: unsigned int is greater than maximum
- Check failed: inp->shape().Size() < 1 >> 31 (4300000000 vs. 0) : Size of tensor you are trying to allocate is larger than 2^32 elements. Please build with flag USE_INT64_TENSOR_SIZE=1
- Invalid Parameter format for end expect int or None but value='2150000000', in operator slice_axis(name="", end="2150000000", begin="0", axis="0"). Basically input attribute was expected to be int32, which is less than 2^31 and the received value is larger than that so, operator's parmeter inference treats that as a string which becomes unexpected input.`
How to use it ?
You can create a large NDArray that requires large tensor enabled build to run as follows:
LARGE_X=4300000000
a = mx.nd.arange(0, LARGE_X, dtype=“int64”)
or
a = nd.ones(shape=LARGE_X)
or
a = nd.empty(LARGE_X)
or
a = nd.random.exponential(shape=LARGE_X)
or
a = nd.random.gamma(shape=LARGE_X)
or
a = nd.random.normal(shape=LARGE_X)
Caveats
- Use
int64
asdtype
whenever attempting to slice an NDArray when range is over maximumint32
value - Use
int64
asdtype
when passing indices as parameters or expecting output as parameters to and from operators
The following are the cases for large tensor usage where you must specify dtype
as int64
:
- randint():
low_large_value = 2*32*
*high_large_value = 2*34
# dtype is explicitly specified since default type is int32 for randint
a = nd.random.randint(low_large_value, high_large_value, dtype=np.int64)
- ravel_multi_index() and unravel_index():
x1, y1 = rand_coord_2d((LARGE_X - 100), LARGE_X, 10, SMALL_Y)
x2, y2 = rand_coord_2d((LARGE_X - 200), LARGE_X, 9, SMALL_Y)
x3, y3 = rand_coord_2d((LARGE_X - 300), LARGE_X, 8, SMALL_Y)
indices_2d = [[x1, x2, x3], [y1, y2, y3]]
# dtype is explicitly specified for indices else they will default to float32
idx = mx.nd.ravel_multi_index(mx.nd.array(indices_2d, dtype=np.int64),
shape=(LARGE_X, SMALL_Y))
indices_2d = mx.nd.unravel_index(mx.nd.array(idx_numpy, dtype=np.int64),
shape=(LARGE_X, SMALL_Y))
- argsort() and topk()
They both return indices which are specified by dtype=np.int64
.
b = create_2d_tensor(rows=LARGE_X, columns=SMALL_Y)
# argsort
s = nd.argsort(b, axis=0, is_ascend=False, dtype=np.int64)
# topk
k = nd.topk(b, k=10, axis=0, dtype=np.int64)
- index_copy()
Again whenever we are passing indices as arguments and using large tensor, the dtype
of indices must be int64
.
x = mx.nd.zeros((LARGE_X, SMALL_Y))
t = mx.nd.arange(1, SMALL_Y + 1).reshape((1, SMALL_Y))
# explicitly specifying dtype of indices to np.int64
index = mx.nd.array([LARGE_X - 1], dtype="int64")
x = mx.nd.contrib.index_copy(x, index, t)
- one_hot()
Here again array is used as indices that act as location of bits inside the large vector that need to be activated.
# a is the index array here whose dtype should be int64.
a = nd.array([1, (VLARGE_X - 1)], dtype=np.int64)
b = nd.one_hot(a, VLARGE_X)
What platforms and version of MXNet are supported ?
You can use MXNet with large tensor support in the following configuration:
MXNet built for CPU on Linux (Ubuntu or Amazon Linux), and only for python bindings. Custom wheels are provided with this configuration.
These flavors of MXNet are currently built with large tensor support:
- MXNet for linux-cpu
- MXNet for linux_cu100
Large tensor support only works for forward pass. Backward pass is partially supported and not completely tested, so it is considered experimental at best.
Not supported:
- GPU and MKLDNN.
- Windows, ARM or any operating system other than Ubuntu
- Any permutation of MXNet wheel that contains MKLDNN.
- Other language bindings like Scala, Java, R, and Julia.
Other known Issues:
Randint operator is flaky: https://github.com/apache/mxnet/issues/16172 dgemm operations using BLAS libraries currently don’t support int64. linspace() is not supported.
a = mx.sym.Variable('a')
b = mx.sym.Variable('b')
c = 2 * a + b
texec = c.bind(mx.cpu(), {'a': nd.arange(0, LARGE_X * 2, dtype='int64').reshape(2, LARGE_X), 'b' : nd.arange(0, LARGE_X * 2, dtype='int64').reshape(2, LARGE_X)})
new_shape = {'a': (1, 2*LARGE_X), 'b': (1, 2*LARGE_X)}
texec.reshape(allow_up_sizing=True, **new_shape)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/mxnet/python/mxnet/executor.py", line 449, in reshape
py_array('i', provided_arg_shape_data)),
OverflowError: signed integer is greater than maximum}
Symbolic reshape is not supported. Please see the following example.
a = mx.sym.Variable('a')
b = mx.sym.Variable('b')
c = 2 * a + b
texec = c.bind(mx.cpu(), {'a': nd.arange(0, LARGE_X * 2, dtype='int64').reshape(2, LARGE_X), 'b' : nd.arange(0, LARGE_X * 2, dtype='int64').reshape(2, LARGE_X)})
new_shape = {'a': (1, 2 * LARGE_X), 'b': (1, 2 * LARGE_X)}
texec.reshape(allow_up_sizing=True, **new_shape)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/mxnet/python/mxnet/executor.py", line 449, in reshape
py_array('i', provided_arg_shape_data)),
OverflowError: signed integer is greater than maximum
Working DGL Example(dgl.ai)
The following is a sample running code for DGL which works with int64 but not with int32.
import mxnet as mx
from mxnet import gluon
import dgl
import dgl.function as fn
import numpy as np
from scipy import sparse as spsp
num_nodes = 10000000
num_edges = 100000000
col1 = np.random.randint(0, num_nodes, size=(num_edges,))
print('create col1')
col2 = np.random.randint(0, num_nodes, size=(num_edges,))
print('create col2')
data = np.ones((num_edges,))
print('create data')
spm = spsp.coo_matrix((data, (col1, col2)), shape=(num_nodes, num_nodes))
print('create coo')
labels = mx.nd.random.randint(0, 10, shape=(num_nodes,))
g = dgl.DGLGraph(spm, readonly=True)
print('create DGLGraph')
g.ndata['h'] = mx.nd.random.uniform(shape=(num_nodes, 200))
print('create node data')
class node_update(gluon.Block):
def __init__(self, out_feats):
super(node_update, self).__init__()
self.dense = gluon.nn.Dense(out_feats, 'relu')
self.dropout = 0.5
def forward(self, nodes):
h = mx.nd.concat(nodes.data['h'], nodes.data['accum'], dim=1)
h = self.dense(h)
return {'h': mx.nd.Dropout(h, p=self.dropout)}
update_fn = node_update(200)
update_fn.initialize(ctx=mx.cpu())
g.update_all(fn.copy_src(src='h', out='m'), fn.sum(msg='m', out='accum'), update_fn)
print('update all')
loss_fcn = gluon.loss.SoftmaxCELoss()
loss = loss_fcn(g.ndata['h'], labels)
print('loss')
loss = loss.sum()
print(loss)
Performance Regression:
Roughly 40 operators have shown performance regression in our preliminary analysis: Large Tensor Performance as shown in table below.
Operator | int32(msec) | int64(msec) | int64/int32 | int32+mkl(msec) | int64+mkl(msec) | int64+mkl/int32+mkl |
---|---|---|---|---|---|---|
topk | 12.81245198 | 42.2472195 | 329.74% | 12.728027 | 43.462353 | 341.47% |
argsort | 16.43896801 | 46.2231455 | 281.18% | 17.200311 | 46.7779985 | 271.96% |
sort | 16.57822751 | 46.5644815 | 280.88% | 16.401236 | 46.263803 | 282.08% |
flip | 0.221817521 | 0.535838 | 241.57% | 0.2123705 | 0.7950055 | 374.35% |
depth_to_space | 0.250976998 | 0.534083 | 212.80% | 0.2338155 | 0.631252 | 269.98% |
space_to_depth | 0.254336512 | 0.5368935 | 211.10% | 0.2334405 | 0.6343175 | 271.73% |
min_axis | 0.685826526 | 1.4393255 | 209.87% | 0.6266175 | 1.3538925 | 216.06% |
sum_axis | 0.720809505 | 1.5110635 | 209.63% | 0.6566265 | 0.8290575 | 126.26% |
nansum | 1.279337012 | 2.635434 | 206.00% | 1.227156 | 2.4305255 | 198.06% |
argmax | 4.765146994 | 9.682672 | 203.20% | 4.6576605 | 9.394067 | 201.69% |
swapaxes | 0.667943008 | 1.3544455 | 202.78% | 0.649036 | 1.8293235 | 281.85% |
argmin | 4.774890491 | 9.545651 | 199.91% | 4.666858 | 9.5194385 | 203.98% |
sum_axis | 0.540210982 | 1.0550705 | 195.31% | 0.500895 | 0.616179 | 123.02% |
max_axis | 0.117824005 | 0.226481 | 192.22% | 0.149085 | 0.224334 | 150.47% |
argmax_channel | 0.261897018 | 0.49573 | 189.28% | 0.251171 | 0.4814885 | 191.70% |
min_axis | 0.147698505 | 0.2675355 | 181.14% | 0.148424 | 0.2874105 | 193.64% |
nansum | 1.142132009 | 2.058077 | 180.20% | 1.042387 | 1.263102 | 121.17% |
min_axis | 0.56951947 | 1.020972 | 179.27% | 0.4722595 | 0.998179 | 211.36% |
min | 1.154684491 | 2.0446045 | 177.07% | 1.0534145 | 1.9723065 | 187.23% |
sum | 1.121753477 | 1.959272 | 174.66% | 0.9984095 | 1.213339 | 121.53% |
sum_axis | 0.158632494 | 0.2744115 | 172.99% | 0.1573735 | 0.2266315 | 144.01% |
nansum | 0.21418152 | 0.3661335 | 170.95% | 0.2162935 | 0.269517 | 124.61% |
random_normal | 1.229072484 | 2.093057 | 170.30% | 1.222785 | 2.095916 | 171.41% |
LeakyReLU | 0.344101485 | 0.582337 | 169.23% | 0.389167 | 0.7003465 | 179.96% |
nanprod | 1.273265516 | 2.095068 | 164.54% | 1.0906815 | 2.054369 | 188.36% |
nanprod | 0.203272473 | 0.32792 | 161.32% | 0.202548 | 0.3288335 | 162.35% |
sample_gamma | 8.079962019 | 12.7266385 | 157.51% | 12.4216245 | 12.7957475 | 103.01% |
sum | 0.21571602 | 0.3396875 | 157.47% | 0.1939995 | 0.262942 | 135.54% |
argmin | 0.086381478 | 0.1354795 | 156.84% | 0.0826235 | 0.134886 | 163.25% |
argmax | 0.08664903 | 0.135826 | 156.75% | 0.082693 | 0.1269225 | 153.49% |
sample_gamma | 7.712843508 | 12.0266355 | 155.93% | 11.8900915 | 12.143009 | 102.13% |
sample_exponential | 2.312778 | 3.5953945 | 155.46% | 3.0935085 | 3.5656265 | 115.26% |
prod | 0.203170988 | 0.3113865 | 153.26% | 0.180757 | 0.264523 | 146.34% |
random_uniform | 0.40893798 | 0.6240795 | 152.61% | 0.244613 | 0.6319695 | 258.35% |
min | 0.205482502 | 0.3122025 | 151.94% | 0.2023835 | 0.33234 | 164.21% |
random_negative_binomial | 3.919228504 | 5.919488 | 151.04% | 5.685851 | 6.0220735 | 105.91% |
max | 0.212521001 | 0.3130105 | 147.28% | 0.2039755 | 0.2956105 | 144.92% |
LeakyReLU | 2.813424013 | 4.1121625 | 146.16% | 2.719118 | 5.613753 | 206.45% |
mean | 0.242281501 | 0.344385 | 142.14% | 0.209396 | 0.313411 | 149.67% |
Deconvolution | 7.43279251 | 10.4240845 | 140.24% | 2.9548925 | 5.812926 | 196.72% |
abs | 0.273286481 | 0.38319 | 140.22% | 0.3711615 | 0.338064 | 91.08% |
arcsinh | 0.155792513 | 0.2090985 | 134.22% | 0.113365 | 0.1702855 | 150.21% |
sample_gamma | 0.137634983 | 0.1842455 | 133.87% | 0.1792825 | 0.172175 | 96.04% |
sort | 0.864107016 | 1.1560165 | 133.78% | 0.8239285 | 1.1454645 | 139.02% |
argsort | 0.847259507 | 1.1320885 | 133.62% | 0.842302 | 1.1179105 | 132.72% |
cosh | 0.129947497 | 0.1727415 | 132.93% | 0.1192565 | 0.1217325 | 102.08% |
random_randint | 0.822044531 | 1.085645 | 132.07% | 0.6036805 | 1.0953995 | 181.45% |
arctanh | 0.119817996 | 0.1576315 | 131.56% | 0.115616 | 0.111907 | 96.79% |
arccos | 0.185662502 | 0.2423095 | 130.51% | 0.238534 | 0.2351415 | 98.58% |
mean | 1.758513477 | 2.2908485 | 130.27% | 1.5868465 | 2.530801 | 159.49% |
erfinv | 0.142498524 | 0.184796 | 129.68% | 0.1529025 | 0.1538225 | 100.60% |
degrees | 0.12517249 | 0.1576175 | 125.92% | 0.1166425 | 0.1199775 | 102.86% |
sample_exponential | 0.07651851 | 0.0960485 | 125.52% | 0.0885775 | 0.095597 | 107.92% |
arctan | 0.120863522 | 0.1496115 | 123.79% | 0.1161245 | 0.17206 | 148.17% |
prod | 1.147695002 | 1.408007 | 122.68% | 1.0491025 | 1.4065515 | 134.07% |
fix | 0.073436997 | 0.089991 | 122.54% | 0.0390455 | 0.099307 | 254.34% |
exp | 0.047701993 | 0.058272 | 122.16% | 0.0397295 | 0.0506725 | 127.54% |