# NDArray

# NDArray: Vectorized Tensor Computations on CPUs and GPUs

`NDArray`

is the basic vectorized operation unit in MXNet for matrix and tensor computations.
Users can perform usual calculations as on an R"s array, but with two additional features:

Multiple devices: All operations can be run on various devices including CPUs and GPUs.

Automatic parallelization: All operations are automatically executed in parallel with each other.

## Create and Initialize

Let"s create `NDArray`

on either a GPU or a CPU:

```
require(mxnet)
```

```
## Loading required package: mxnet
## Loading required package: methods
```

```
a <- mx.nd.zeros(c(2, 3)) # create a 2-by-3 matrix on cpu
b <- mx.nd.zeros(c(2, 3), mx.cpu()) # create a 2-by-3 matrix on cpu
# c <- mx.nd.zeros(c(2, 3), mx.gpu(0)) # create a 2-by-3 matrix on gpu 0, if you have CUDA enabled.
```

Typically for CUDA-enabled devices, the device id of a GPU starts from 0. That's why we passed in 0 to the GPU id.

We can initialize an `NDArray`

object in various ways:

```
a <- mx.nd.ones(c(4, 4))
b <- mx.rnorm(c(4, 5))
c <- mx.nd.array(1:5)
```

To check the numbers in an `NDArray`

, we can simply run:

```
a <- mx.nd.ones(c(2, 3))
b <- as.array(a)
class(b)
```

```
## [1] "matrix"
```

```
b
```

```
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 1 1
```

## Performing Basic Operations

### Elemental-wise Operations

You can perform elemental-wise operations on `NDArray`

objects, as follows:

```
a <- mx.nd.ones(c(2, 4)) * 2
b <- mx.nd.ones(c(2, 4)) / 8
as.array(a)
```

```
## [,1] [,2] [,3] [,4]
## [1,] 2 2 2 2
## [2,] 2 2 2 2
```

```
as.array(b)
```

```
## [,1] [,2] [,3] [,4]
## [1,] 0.125 0.125 0.125 0.125
## [2,] 0.125 0.125 0.125 0.125
```

```
c <- a + b
as.array(c)
```

```
## [,1] [,2] [,3] [,4]
## [1,] 2.125 2.125 2.125 2.125
## [2,] 2.125 2.125 2.125 2.125
```

```
d <- c / a - 5
as.array(d)
```

```
## [,1] [,2] [,3] [,4]
## [1,] -3.9375 -3.9375 -3.9375 -3.9375
## [2,] -3.9375 -3.9375 -3.9375 -3.9375
```

If two `NDArray`

s are located on different devices, we need to explicitly move them to the same one. For instance:

```
a <- mx.nd.ones(c(2, 3)) * 2
b <- mx.nd.ones(c(2, 3), mx.gpu()) / 8
c <- mx.nd.copyto(a, mx.gpu()) * b
as.array(c)
```

### Loading and Saving

You can save a list of `NDArray`

object to your disk with `mx.nd.save`

:

```
a <- mx.nd.ones(c(2, 3))
mx.nd.save(list(a), "temp.ndarray")
```

You can load it back easily:

```
a <- mx.nd.load("temp.ndarray")
as.array(a[[1]])
```

```
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 1 1
```

We can directly save data to and load it from a distributed file system, such as Amazon S3 and HDFS:

```
mx.nd.save(list(a), "s3://mybucket/mydata.bin")
mx.nd.save(list(a), "hdfs///users/myname/mydata.bin")
```

## Automatic Parallelization

`NDArray`

can automatically execute operations in parallel. Automatic parallelization is useful when
using multiple resources, such as CPU cards, GPU cards, and CPU-to-GPU memory bandwidth.

For example, if we write `a <- a + 1`

followed by `b <- b + 1`

, and `a`

is on a CPU and
`b`

is on a GPU, executing them in parallel improves
efficiency. Furthermore, because copying data between CPUs and GPUs are also expensive, running in parallel with other computations further increases efficiency.

It's hard to find the code that can be executed in parallel by eye. In the
following example, `a <- a + 1`

and `c <- c * 3`

can be executed in parallel, but `a <- a + 1`

and
`b <- b * 3`

should be in sequential.

```
a <- mx.nd.ones(c(2,3))
b <- a
c <- mx.nd.copyto(a, mx.cpu())
a <- a + 1
b <- b * 3
c <- c * 3
```

Luckily, MXNet can automatically resolve the dependencies and execute operations in parallel accurately. This allows us to write our program assuming there is only a single thread. MXNet will automatically dispatch the program to multiple devices.

MXNet achieves this with lazy evaluation. Each operation is issued to an
internal engine, and then returned. For example, if we run `a <- a + 1`

, it
returns immediately after pushing the plus operator to the engine. This
asynchronous processing allows us to push more operators to the engine. It determines
the read and write dependencies and the best way to execute them in
parallel.

The actual computations are finished, allowing us to copy the results someplace else, such as `as.array(a)`

or `mx.nd.save(a, "temp.dat")`

. To write highly parallelized codes, we only need to postpone when we need
the results.