Gluon Neural Network Layers

Overview

This document lists the neural network blocks in Gluon:

Basic Layers

Dense Just your regular densely-connected NN layer.
Dropout Applies Dropout to the input.
BatchNorm Batch normalization layer (Ioffe and Szegedy, 2014).
InstanceNorm Applies instance normalization to the n-dimensional input array.
LayerNorm Applies layer normalization to the n-dimensional input array.
Embedding Turns non-negative integers (indexes/tokens) into dense vectors of fixed size.
Flatten Flattens the input to two dimensional.
Lambda Wraps an operator or an expression as a Block object.
HybridLambda Wraps an operator or an expression as a HybridBlock object.

Convolutional Layers

Conv1D 1D convolution layer (e.g. temporal convolution).
Conv2D 2D convolution layer (e.g. spatial convolution over images).
Conv3D 3D convolution layer (e.g. spatial convolution over volumes).
Conv1DTranspose Transposed 1D convolution layer (sometimes called Deconvolution).
Conv2DTranspose Transposed 2D convolution layer (sometimes called Deconvolution).
Conv3DTranspose Transposed 3D convolution layer (sometimes called Deconvolution).

Pooling Layers

MaxPool1D Max pooling operation for one dimensional data.
MaxPool2D Max pooling operation for two dimensional (spatial) data.
MaxPool3D Max pooling operation for 3D data (spatial or spatio-temporal).
AvgPool1D Average pooling operation for temporal data.
AvgPool2D Average pooling operation for spatial data.
AvgPool3D Average pooling operation for 3D data (spatial or spatio-temporal).
GlobalMaxPool1D Gloabl max pooling operation for one dimensional (temporal) data.
GlobalMaxPool2D Global max pooling operation for two dimensional (spatial) data.
GlobalMaxPool3D Global max pooling operation for 3D data (spatial or spatio-temporal).
GlobalAvgPool1D Global average pooling operation for temporal data.
GlobalAvgPool2D Global average pooling operation for spatial data.
GlobalAvgPool3D Global average pooling operation for 3D data (spatial or spatio-temporal).
ReflectionPad2D Pads the input tensor using the reflection of the input boundary.

Activation Layers

Activation Applies an activation function to input.
LeakyReLU Leaky version of a Rectified Linear Unit.
PReLU Parametric leaky version of a Rectified Linear Unit.
ELU Exponential Linear Unit (ELU)
SELU Scaled Exponential Linear Unit (SELU)
Swish Swish Activation function

API Reference

Neural network layers.

class mxnet.gluon.nn.Activation(activation, **kwargs)[source]

Applies an activation function to input.

Parameters:activation (str) – Name of activation function to use. See Activation() for available choices.
Inputs:
  • data: input tensor with arbitrary shape.
Outputs:
  • out: output tensor with the same shape as data.
class mxnet.gluon.nn.AvgPool1D(pool_size=2, strides=None, padding=0, layout='NCW', ceil_mode=False, count_include_pad=True, **kwargs)[source]

Average pooling operation for temporal data.

Parameters:
  • pool_size (int) – Size of the average pooling windows.
  • strides (int, or None) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
  • padding (int) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.
  • layout (str, default 'NCW') – Dimension ordering of data and out (‘NCW’ or ‘NWC’). ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. padding is applied on ‘W’ dimension.
  • ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
  • count_include_pad (bool, default True) – When ‘False’, will exclude padding elements when computing the average value.
Inputs:
  • data: 3D input tensor with shape (batch_size, in_channels, width) when layout is NCW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 3D output tensor with shape (batch_size, channels, out_width) when layout is NCW. out_width is calculated as:

    out_width = floor((width+2*padding-pool_size)/strides)+1
    

    When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.AvgPool2D(pool_size=(2, 2), strides=None, padding=0, ceil_mode=False, layout='NCHW', count_include_pad=True, **kwargs)[source]

Average pooling operation for spatial data.

Parameters:
  • pool_size (int or list/tuple of 2 ints,) – Size of the average pooling windows.
  • strides (int, list/tuple of 2 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
  • padding (int or list/tuple of 2 ints,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.
  • layout (str, default 'NCHW') – Dimension ordering of data and out (‘NCHW’ or ‘NHWC’). ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. padding is applied on ‘H’ and ‘W’ dimension.
  • ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
  • count_include_pad (bool, default True) – When ‘False’, will exclude padding elements when computing the average value.
Inputs:
  • data: 4D input tensor with shape (batch_size, in_channels, height, width) when layout is NCHW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 4D output tensor with shape (batch_size, channels, out_height, out_width) when layout is NCHW. out_height and out_width are calculated as:

    out_height = floor((height+2*padding[0]-pool_size[0])/strides[0])+1
    out_width = floor((width+2*padding[1]-pool_size[1])/strides[1])+1
    

    When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.AvgPool3D(pool_size=(2, 2, 2), strides=None, padding=0, ceil_mode=False, layout='NCDHW', count_include_pad=True, **kwargs)[source]

Average pooling operation for 3D data (spatial or spatio-temporal).

Parameters:
  • pool_size (int or list/tuple of 3 ints,) – Size of the average pooling windows.
  • strides (int, list/tuple of 3 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
  • padding (int or list/tuple of 3 ints,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.
  • layout (str, default 'NCDHW') – Dimension ordering of data and out (‘NCDHW’ or ‘NDHWC’). ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. padding is applied on ‘D’, ‘H’ and ‘W’ dimension.
  • ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
  • count_include_pad (bool, default True) – When ‘False’, will exclude padding elements when computing the average value.
Inputs:
  • data: 5D input tensor with shape (batch_size, in_channels, depth, height, width) when layout is NCDHW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 5D output tensor with shape (batch_size, channels, out_depth, out_height, out_width) when layout is NCDHW. out_depth, out_height and out_width are calculated as:

    out_depth = floor((depth+2*padding[0]-pool_size[0])/strides[0])+1
    out_height = floor((height+2*padding[1]-pool_size[1])/strides[1])+1
    out_width = floor((width+2*padding[2]-pool_size[2])/strides[2])+1
    

    When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.BatchNorm(axis=1, momentum=0.9, epsilon=1e-05, center=True, scale=True, use_global_stats=False, beta_initializer='zeros', gamma_initializer='ones', running_mean_initializer='zeros', running_variance_initializer='ones', in_channels=0, **kwargs)[source]

Batch normalization layer (Ioffe and Szegedy, 2014). Normalizes the input at each batch, i.e. applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.

Parameters:
  • axis (int, default 1) – The axis that should be normalized. This is typically the channels (C) axis. For instance, after a Conv2D layer with layout=’NCHW’, set axis=1 in BatchNorm. If layout=’NHWC’, then set axis=3.
  • momentum (float, default 0.9) – Momentum for the moving average.
  • epsilon (float, default 1e-5) – Small float added to variance to avoid dividing by zero.
  • center (bool, default True) – If True, add offset of beta to normalized tensor. If False, beta is ignored.
  • scale (bool, default True) – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling will be done by the next layer.
  • use_global_stats (bool, default False) – If True, use global moving statistics instead of local batch-norm. This will force change batch-norm into a scale shift operator. If False, use local batch-norm.
  • beta_initializer (str or Initializer, default ‘zeros’) – Initializer for the beta weight.
  • gamma_initializer (str or Initializer, default ‘ones’) – Initializer for the gamma weight.
  • running_mean_initializer (str or Initializer, default ‘zeros’) – Initializer for the running mean.
  • running_variance_initializer (str or Initializer, default ‘ones’) – Initializer for the running variance.
  • in_channels (int, default 0) – Number of channels (feature maps) in input data. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Inputs:
  • data: input tensor with arbitrary shape.
Outputs:
  • out: output tensor with the same shape as data.
class mxnet.gluon.nn.Conv1D(channels, kernel_size, strides=1, padding=0, dilation=1, groups=1, layout='NCW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)[source]

1D convolution layer (e.g. temporal convolution).

This layer creates a convolution kernel that is convolved with the layer input over a single spatial (or temporal) dimension to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters:
  • channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
  • kernel_size (int or tuple/list of 1 int) – Specifies the dimensions of the convolution window.
  • strides (int or tuple/list of 1 int,) – Specify the strides of the convolution.
  • padding (int or a tuple/list of 1 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
  • dilation (int or tuple/list of 1 int) – Specifies the dilation rate to use for dilated convolution.
  • groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
  • layout (str, default 'NCW') – Dimension ordering of data and weight. Only supports ‘NCW’ layout for now. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Convolution is applied on the ‘W’ dimension.
  • in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
  • activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool) – Whether the layer uses a bias vector.
  • weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
Inputs:
  • data: 3D input tensor with shape (batch_size, in_channels, width) when layout is NCW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 3D output tensor with shape (batch_size, channels, out_width) when layout is NCW. out_width is calculated as:

    out_width = floor((width+2*padding-dilation*(kernel_size-1)-1)/stride)+1
    
class mxnet.gluon.nn.Conv1DTranspose(channels, kernel_size, strides=1, padding=0, output_padding=0, dilation=1, groups=1, layout='NCW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)[source]

Transposed 1D convolution layer (sometimes called Deconvolution).

The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters:
  • channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
  • kernel_size (int or tuple/list of 1 int) – Specifies the dimensions of the convolution window.
  • strides (int or tuple/list of 1 int) – Specify the strides of the convolution.
  • padding (int or a tuple/list of 1 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
  • output_padding (int or a tuple/list of 1 int) – Controls the amount of implicit zero-paddings on both sides of the output for output_padding number of points for each dimension.
  • dilation (int or tuple/list of 1 int) – Controls the spacing between the kernel points; also known as the a trous algorithm
  • groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
  • layout (str, default 'NCW') – Dimension ordering of data and weight. Only supports ‘NCW’ layout for now. ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Convolution is applied on the ‘W’ dimension.
  • in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
  • activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool) – Whether the layer uses a bias vector.
  • weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
Inputs:
  • data: 3D input tensor with shape (batch_size, in_channels, width) when layout is NCW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 3D output tensor with shape (batch_size, channels, out_width) when layout is NCW. out_width is calculated as:

    out_width = (width-1)*strides-2*padding+kernel_size+output_padding
    
class mxnet.gluon.nn.Conv2D(channels, kernel_size, strides=(1, 1), padding=(0, 0), dilation=(1, 1), groups=1, layout='NCHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)[source]

2D convolution layer (e.g. spatial convolution over images).

This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters:
  • channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
  • kernel_size (int or tuple/list of 2 int) – Specifies the dimensions of the convolution window.
  • strides (int or tuple/list of 2 int,) – Specify the strides of the convolution.
  • padding (int or a tuple/list of 2 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
  • dilation (int or tuple/list of 2 int) – Specifies the dilation rate to use for dilated convolution.
  • groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
  • layout (str, default 'NCHW') – Dimension ordering of data and weight. Only supports ‘NCHW’ and ‘NHWC’ layout for now. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. Convolution is applied on the ‘H’ and ‘W’ dimensions.
  • in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
  • activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool) – Whether the layer uses a bias vector.
  • weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
Inputs:
  • data: 4D input tensor with shape (batch_size, in_channels, height, width) when layout is NCHW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 4D output tensor with shape (batch_size, channels, out_height, out_width) when layout is NCHW. out_height and out_width are calculated as:

    out_height = floor((height+2*padding[0]-dilation[0]*(kernel_size[0]-1)-1)/stride[0])+1
    out_width = floor((width+2*padding[1]-dilation[1]*(kernel_size[1]-1)-1)/stride[1])+1
    
class mxnet.gluon.nn.Conv2DTranspose(channels, kernel_size, strides=(1, 1), padding=(0, 0), output_padding=(0, 0), dilation=(1, 1), groups=1, layout='NCHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)[source]

Transposed 2D convolution layer (sometimes called Deconvolution).

The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters:
  • channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
  • kernel_size (int or tuple/list of 2 int) – Specifies the dimensions of the convolution window.
  • strides (int or tuple/list of 2 int) – Specify the strides of the convolution.
  • padding (int or a tuple/list of 2 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
  • output_padding (int or a tuple/list of 2 int) – Controls the amount of implicit zero-paddings on both sides of the output for output_padding number of points for each dimension.
  • dilation (int or tuple/list of 2 int) – Controls the spacing between the kernel points; also known as the a trous algorithm
  • groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
  • layout (str, default 'NCHW') – Dimension ordering of data and weight. Only supports ‘NCHW’ and ‘NHWC’ layout for now. ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. Convolution is applied on the ‘H’ and ‘W’ dimensions.
  • in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
  • activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool) – Whether the layer uses a bias vector.
  • weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
Inputs:
  • data: 4D input tensor with shape (batch_size, in_channels, height, width) when layout is NCHW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 4D output tensor with shape (batch_size, channels, out_height, out_width) when layout is NCHW. out_height and out_width are calculated as:

    out_height = (height-1)*strides[0]-2*padding[0]+kernel_size[0]+output_padding[0]
    out_width = (width-1)*strides[1]-2*padding[1]+kernel_size[1]+output_padding[1]
    
class mxnet.gluon.nn.Conv3D(channels, kernel_size, strides=(1, 1, 1), padding=(0, 0, 0), dilation=(1, 1, 1), groups=1, layout='NCDHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)[source]

3D convolution layer (e.g. spatial convolution over volumes).

This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs. If use_bias is True, a bias vector is created and added to the outputs. Finally, if activation is not None, it is applied to the outputs as well.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters:
  • channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
  • kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
  • strides (int or tuple/list of 3 int,) – Specify the strides of the convolution.
  • padding (int or a tuple/list of 3 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
  • dilation (int or tuple/list of 3 int) – Specifies the dilation rate to use for dilated convolution.
  • groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
  • layout (str, default 'NCDHW') – Dimension ordering of data and weight. Only supports ‘NCDHW’ and ‘NDHWC’ layout for now. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. Convolution is applied on the ‘D’, ‘H’ and ‘W’ dimensions.
  • in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
  • activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool) – Whether the layer uses a bias vector.
  • weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
Inputs:
  • data: 5D input tensor with shape (batch_size, in_channels, depth, height, width) when layout is NCDHW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 5D output tensor with shape (batch_size, channels, out_depth, out_height, out_width) when layout is NCDHW. out_depth, out_height and out_width are calculated as:

    out_depth = floor((depth+2*padding[0]-dilation[0]*(kernel_size[0]-1)-1)/stride[0])+1
    out_height = floor((height+2*padding[1]-dilation[1]*(kernel_size[1]-1)-1)/stride[1])+1
    out_width = floor((width+2*padding[2]-dilation[2]*(kernel_size[2]-1)-1)/stride[2])+1
    
class mxnet.gluon.nn.Conv3DTranspose(channels, kernel_size, strides=(1, 1, 1), padding=(0, 0, 0), output_padding=(0, 0, 0), dilation=(1, 1, 1), groups=1, layout='NCDHW', activation=None, use_bias=True, weight_initializer=None, bias_initializer='zeros', in_channels=0, **kwargs)[source]

Transposed 3D convolution layer (sometimes called Deconvolution).

The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution.

If in_channels is not specified, Parameter initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.

Parameters:
  • channels (int) – The dimensionality of the output space, i.e. the number of output channels (filters) in the convolution.
  • kernel_size (int or tuple/list of 3 int) – Specifies the dimensions of the convolution window.
  • strides (int or tuple/list of 3 int) – Specify the strides of the convolution.
  • padding (int or a tuple/list of 3 int,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
  • output_padding (int or a tuple/list of 3 int) – Controls the amount of implicit zero-paddings on both sides of the output for output_padding number of points for each dimension.
  • dilation (int or tuple/list of 3 int) – Controls the spacing between the kernel points; also known as the a trous algorithm.
  • groups (int) – Controls the connections between inputs and outputs. At groups=1, all inputs are convolved to all outputs. At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.
  • layout (str, default 'NCDHW') – Dimension ordering of data and weight. Only supports ‘NCDHW’ and ‘NDHWC’ layout for now. ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. Convolution is applied on the ‘D’, ‘H’ and ‘W’ dimensions.
  • in_channels (int, default 0) – The number of input channels to this layer. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
  • activation (str) – Activation function to use. See Activation(). If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool) – Whether the layer uses a bias vector.
  • weight_initializer (str or Initializer) – Initializer for the weight weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
Inputs:
  • data: 5D input tensor with shape (batch_size, in_channels, depth, height, width) when layout is NCDHW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 5D output tensor with shape (batch_size, channels, out_depth, out_height, out_width) when layout is NCDHW. out_depth, out_height and out_width are calculated as:

    out_depth = (depth-1)*strides[0]-2*padding[0]+kernel_size[0]+output_padding[0]
    out_height = (height-1)*strides[1]-2*padding[1]+kernel_size[1]+output_padding[1]
    out_width = (width-1)*strides[2]-2*padding[2]+kernel_size[2]+output_padding[2]
    
class mxnet.gluon.nn.Dense(units, activation=None, use_bias=True, flatten=True, dtype='float32', weight_initializer=None, bias_initializer='zeros', in_units=0, **kwargs)[source]

Just your regular densely-connected NN layer.

Dense implements the operation: output = activation(dot(input, weight) + bias) where activation is the element-wise activation function passed as the activation argument, weight is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).

Note: the input must be a tensor with rank 2. Use flatten to convert it to rank 2 manually if necessary.

Parameters:
  • units (int) – Dimensionality of the output space.
  • activation (str) – Activation function to use. See help on Activation layer. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).
  • use_bias (bool, default True) – Whether the layer uses a bias vector.
  • flatten (bool, default True) – Whether the input tensor should be flattened. If true, all but the first axis of input data are collapsed together. If false, all but the last axis of input data are kept the same, and the transformation applies on the last axis.
  • dtype (str or np.dtype, default 'float32') – Data type of output embeddings.
  • weight_initializer (str or Initializer) – Initializer for the kernel weights matrix.
  • bias_initializer (str or Initializer) – Initializer for the bias vector.
  • in_units (int, optional) – Size of the input data. If not specified, initialization will be deferred to the first time forward is called and in_units will be inferred from the shape of input data.
  • prefix (str or None) – See document of Block.
  • params (ParameterDict or None) – See document of Block.
Inputs:
  • data: if flatten is True, data should be a tensor with shape (batch_size, x1, x2, ..., xn), where x1 * x2 * ... * xn is equal to in_units. If flatten is False, data should have shape (x1, x2, ..., xn, in_units).
Outputs:
  • out: if flatten is True, out will be a tensor with shape (batch_size, units). If flatten is False, out will have shape (x1, x2, ..., xn, units).
class mxnet.gluon.nn.Dropout(rate, axes=(), **kwargs)[source]

Applies Dropout to the input.

Dropout consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting.

Parameters:
  • rate (float) – Fraction of the input units to drop. Must be a number between 0 and 1.
  • axes (tuple of int, default ()) – The axes on which dropout mask is shared. If empty, regular dropout is applied.
Inputs:
  • data: input tensor with arbitrary shape.
Outputs:
  • out: output tensor with the same shape as data.

References

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

class mxnet.gluon.nn.ELU(alpha=1.0, **kwargs)[source]
Exponential Linear Unit (ELU)
“Fast and Accurate Deep Network Learning by Exponential Linear Units”, Clevert et al, 2016 https://arxiv.org/abs/1511.07289 Published as a conference paper at ICLR 2016
Parameters:alpha (float) – The alpha parameter as described by Clevert et al, 2016
Inputs:
  • data: input tensor with arbitrary shape.
Outputs:
  • out: output tensor with the same shape as data.
class mxnet.gluon.nn.Embedding(input_dim, output_dim, dtype='float32', weight_initializer=None, sparse_grad=False, **kwargs)[source]

Turns non-negative integers (indexes/tokens) into dense vectors of fixed size. eg. [4, 20] -> [[0.25, 0.1], [0.6, -0.2]]

Note: if sparse_grad is set to True, the gradient w.r.t weight will be sparse. Only a subset of optimizers support sparse gradients, including SGD, AdaGrad and Adam. By default lazy updates is turned on, which may perform differently from standard updates. For more details, please check the Optimization API at: /api/python/optimization/optimization.html

Parameters:
  • input_dim (int) – Size of the vocabulary, i.e. maximum integer index + 1.
  • output_dim (int) – Dimension of the dense embedding.
  • dtype (str or np.dtype, default 'float32') – Data type of output embeddings.
  • weight_initializer (Initializer) – Initializer for the embeddings matrix.
  • sparse_grad (bool) – If True, gradient w.r.t. weight will be a ‘row_sparse’ NDArray.
  • Inputs
    • data: (N-1)-D tensor with shape: (x1, x2, ..., xN-1).
  • Output
    • out: N-D tensor with shape: (x1, x2, ..., xN-1, output_dim).
class mxnet.gluon.nn.Flatten(**kwargs)[source]

Flattens the input to two dimensional.

Inputs:
  • data: input tensor with arbitrary shape (N, x1, x2, ..., xn)
Output:
  • out: 2D tensor with shape: (N, x1 cdot x2 cdot ... cdot xn)
class mxnet.gluon.nn.GELU(**kwargs)[source]
Gaussian Exponential Linear Unit (GELU)
“Gaussian Error Linear Units (GELUs)”, Hendrycks et al, 2016 https://arxiv.org/abs/1606.08415
Inputs:
  • data: input tensor with arbitrary shape.
Outputs:
  • out: output tensor with the same shape as data.
class mxnet.gluon.nn.GlobalAvgPool1D(layout='NCW', **kwargs)[source]

Global average pooling operation for temporal data.

Parameters:layout (str, default 'NCW') – Dimension ordering of data and out (‘NCW’ or ‘NWC’). ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. padding is applied on ‘W’ dimension.
Inputs:
  • data: 3D input tensor with shape (batch_size, in_channels, width) when layout is NCW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 3D output tensor with shape (batch_size, channels, 1).
class mxnet.gluon.nn.GlobalAvgPool2D(layout='NCHW', **kwargs)[source]

Global average pooling operation for spatial data.

Parameters:layout (str, default 'NCHW') – Dimension ordering of data and out (‘NCHW’ or ‘NHWC’). ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively.
Inputs:
  • data: 4D input tensor with shape (batch_size, in_channels, height, width) when layout is NCHW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 4D output tensor with shape (batch_size, channels, 1, 1) when layout is NCHW.
class mxnet.gluon.nn.GlobalAvgPool3D(layout='NCDHW', **kwargs)[source]

Global average pooling operation for 3D data (spatial or spatio-temporal).

Parameters:layout (str, default 'NCDHW') – Dimension ordering of data and out (‘NCDHW’ or ‘NDHWC’). ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. padding is applied on ‘D’, ‘H’ and ‘W’ dimension.
Inputs:
  • data: 5D input tensor with shape (batch_size, in_channels, depth, height, width) when layout is NCDHW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 5D output tensor with shape (batch_size, channels, 1, 1, 1) when layout is NCDHW.
class mxnet.gluon.nn.GlobalMaxPool1D(layout='NCW', **kwargs)[source]

Gloabl max pooling operation for one dimensional (temporal) data.

Parameters:layout (str, default 'NCW') – Dimension ordering of data and out (‘NCW’ or ‘NWC’). ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Pooling is applied on the W dimension.
Inputs:
  • data: 3D input tensor with shape (batch_size, in_channels, width) when layout is NCW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 3D output tensor with shape (batch_size, channels, 1) when layout is NCW.
class mxnet.gluon.nn.GlobalMaxPool2D(layout='NCHW', **kwargs)[source]

Global max pooling operation for two dimensional (spatial) data.

Parameters:layout (str, default 'NCHW') – Dimension ordering of data and out (‘NCHW’ or ‘NHWC’). ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. padding is applied on ‘H’ and ‘W’ dimension.
Inputs:
  • data: 4D input tensor with shape (batch_size, in_channels, height, width) when layout is NCHW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 4D output tensor with shape (batch_size, channels, 1, 1) when layout is NCHW.
class mxnet.gluon.nn.GlobalMaxPool3D(layout='NCDHW', **kwargs)[source]

Global max pooling operation for 3D data (spatial or spatio-temporal).

Parameters:layout (str, default 'NCDHW') – Dimension ordering of data and out (‘NCDHW’ or ‘NDHWC’). ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. padding is applied on ‘D’, ‘H’ and ‘W’ dimension.
Inputs:
  • data: 5D input tensor with shape (batch_size, in_channels, depth, height, width) when layout is NCW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 5D output tensor with shape (batch_size, channels, 1, 1, 1) when layout is NCDHW.
class mxnet.gluon.nn.HybridLambda(function, prefix=None)[source]

Wraps an operator or an expression as a HybridBlock object.

Parameters:
  • function (str or function) –

    Function used in lambda must be one of the following: 1) The name of an operator that is available in both symbol and ndarray. For example:

    block = HybridLambda('tanh')
    
    1. A function that conforms to def function(F, data, *args). For example:
      block = HybridLambda(lambda F, x: F.LeakyReLU(x, slope=0.1))
      
  • Inputs
    • ** args *: one or more input data. First argument must be symbol or ndarray. Their
      shapes depend on the function.
  • Output
    • ** outputs *: one or more output data. Their shapes depend on the function.
class mxnet.gluon.nn.InstanceNorm(axis=1, epsilon=1e-05, center=True, scale=False, beta_initializer='zeros', gamma_initializer='ones', in_channels=0, **kwargs)[source]

Applies instance normalization to the n-dimensional input array. This operator takes an n-dimensional input array where (n>2) and normalizes the input using the following formula:

\[ \begin{align}\begin{aligned}\bar{C} = \{i \mid i \neq 0, i \neq axis\}\\out = \frac{x - mean[data, \bar{C}]}{ \sqrt{Var[data, \bar{C}]} + \epsilon} * gamma + beta\end{aligned}\end{align} \]
Parameters:
  • axis (int, default 1) – The axis that will be excluded in the normalization process. This is typically the channels (C) axis. For instance, after a Conv2D layer with layout=’NCHW’, set axis=1 in InstanceNorm. If layout=’NHWC’, then set axis=3. Data will be normalized along axes excluding the first axis and the axis given.
  • epsilon (float, default 1e-5) – Small float added to variance to avoid dividing by zero.
  • center (bool, default True) – If True, add offset of beta to normalized tensor. If False, beta is ignored.
  • scale (bool, default True) – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling will be done by the next layer.
  • beta_initializer (str or Initializer, default ‘zeros’) – Initializer for the beta weight.
  • gamma_initializer (str or Initializer, default ‘ones’) – Initializer for the gamma weight.
  • in_channels (int, default 0) – Number of channels (feature maps) in input data. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Inputs:
  • data: input tensor with arbitrary shape.
Outputs:
  • out: output tensor with the same shape as data.

References

Instance Normalization: The Missing Ingredient for Fast Stylization

Examples

>>> # Input of shape (2,1,2)
>>> x = mx.nd.array([[[ 1.1,  2.2]],
...                 [[ 3.3,  4.4]]])
>>> # Instance normalization is calculated with the above formula
>>> layer = InstanceNorm()
>>> layer.initialize(ctx=mx.cpu(0))
>>> layer(x)
[[[-0.99998355  0.99998331]]
 [[-0.99998319  0.99998361]]]

class mxnet.gluon.nn.Lambda(function, prefix=None)[source]

Wraps an operator or an expression as a Block object.

Parameters:
  • function (str or function) –

    Function used in lambda must be one of the following: 1) the name of an operator that is available in ndarray. For example:

    block = Lambda('tanh')
    
    1. a function that conforms to def function(*args). For example:
      block = Lambda(lambda x: nd.LeakyReLU(x, slope=0.1))
      
  • Inputs
    • ** args *: one or more input data. Their shapes depend on the function.
  • Output
    • ** outputs *: one or more output data. Their shapes depend on the function.
class mxnet.gluon.nn.LayerNorm(axis=-1, epsilon=1e-05, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', in_channels=0, prefix=None, params=None)[source]

Applies layer normalization to the n-dimensional input array. This operator takes an n-dimensional input array and normalizes the input using the given axis:

\[out = \frac{x - mean[data, axis]}{ \sqrt{Var[data, axis] + \epsilon}} * gamma + beta\]
Parameters:
  • axis (int, default -1) – The axis that should be normalized. This is typically the axis of the channels.
  • epsilon (float, default 1e-5) – Small float added to variance to avoid dividing by zero.
  • center (bool, default True) – If True, add offset of beta to normalized tensor. If False, beta is ignored.
  • scale (bool, default True) – If True, multiply by gamma. If False, gamma is not used.
  • beta_initializer (str or Initializer, default ‘zeros’) – Initializer for the beta weight.
  • gamma_initializer (str or Initializer, default ‘ones’) – Initializer for the gamma weight.
  • in_channels (int, default 0) – Number of channels (feature maps) in input data. If not specified, initialization will be deferred to the first time forward is called and in_channels will be inferred from the shape of input data.
Inputs:
  • data: input tensor with arbitrary shape.
Outputs:
  • out: output tensor with the same shape as data.

References

Layer Normalization

Examples

>>> # Input of shape (2, 5)
>>> x = mx.nd.array([[1, 2, 3, 4, 5], [1, 1, 2, 2, 2]])
>>> # Layer normalization is calculated with the above formula
>>> layer = LayerNorm()
>>> layer.initialize(ctx=mx.cpu(0))
>>> layer(x)
[[-1.41421    -0.707105    0.          0.707105    1.41421   ]
 [-1.2247195  -1.2247195   0.81647956  0.81647956  0.81647956]]

class mxnet.gluon.nn.LeakyReLU(alpha, **kwargs)[source]

Leaky version of a Rectified Linear Unit.

It allows a small gradient when the unit is not active

\[\begin{split}f\left(x\right) = \left\{ \begin{array}{lr} \alpha x & : x \lt 0 \\ x & : x \geq 0 \\ \end{array} \right.\\\end{split}\]
Parameters:alpha (float) – slope coefficient for the negative half axis. Must be >= 0.
Inputs:
  • data: input tensor with arbitrary shape.
Outputs:
  • out: output tensor with the same shape as data.
class mxnet.gluon.nn.MaxPool1D(pool_size=2, strides=None, padding=0, layout='NCW', ceil_mode=False, **kwargs)[source]

Max pooling operation for one dimensional data.

Parameters:
  • pool_size (int) – Size of the max pooling windows.
  • strides (int, or None) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
  • padding (int) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.
  • layout (str, default 'NCW') – Dimension ordering of data and out (‘NCW’ or ‘NWC’). ‘N’, ‘C’, ‘W’ stands for batch, channel, and width (time) dimensions respectively. Pooling is applied on the W dimension.
  • ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Inputs:
  • data: 3D input tensor with shape (batch_size, in_channels, width) when layout is NCW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 3D output tensor with shape (batch_size, channels, out_width) when layout is NCW. out_width is calculated as:

    out_width = floor((width+2*padding-pool_size)/strides)+1
    

    When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.MaxPool2D(pool_size=(2, 2), strides=None, padding=0, layout='NCHW', ceil_mode=False, **kwargs)[source]

Max pooling operation for two dimensional (spatial) data.

Parameters:
  • pool_size (int or list/tuple of 2 ints,) – Size of the max pooling windows.
  • strides (int, list/tuple of 2 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
  • padding (int or list/tuple of 2 ints,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.
  • layout (str, default 'NCHW') – Dimension ordering of data and out (‘NCHW’ or ‘NHWC’). ‘N’, ‘C’, ‘H’, ‘W’ stands for batch, channel, height, and width dimensions respectively. padding is applied on ‘H’ and ‘W’ dimension.
  • ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Inputs:
  • data: 4D input tensor with shape (batch_size, in_channels, height, width) when layout is NCHW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 4D output tensor with shape (batch_size, channels, out_height, out_width) when layout is NCHW. out_height and out_width are calculated as:

    out_height = floor((height+2*padding[0]-pool_size[0])/strides[0])+1
    out_width = floor((width+2*padding[1]-pool_size[1])/strides[1])+1
    

    When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.MaxPool3D(pool_size=(2, 2, 2), strides=None, padding=0, ceil_mode=False, layout='NCDHW', **kwargs)[source]

Max pooling operation for 3D data (spatial or spatio-temporal).

Parameters:
  • pool_size (int or list/tuple of 3 ints,) – Size of the max pooling windows.
  • strides (int, list/tuple of 3 ints, or None.) – Factor by which to downscale. E.g. 2 will halve the input size. If None, it will default to pool_size.
  • padding (int or list/tuple of 3 ints,) – If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points.
  • layout (str, default 'NCDHW') – Dimension ordering of data and out (‘NCDHW’ or ‘NDHWC’). ‘N’, ‘C’, ‘H’, ‘W’, ‘D’ stands for batch, channel, height, width and depth dimensions respectively. padding is applied on ‘D’, ‘H’ and ‘W’ dimension.
  • ceil_mode (bool, default False) – When True, will use ceil instead of floor to compute the output shape.
Inputs:
  • data: 5D input tensor with shape (batch_size, in_channels, depth, height, width) when layout is NCW. For other layouts shape is permuted accordingly.
Outputs:
  • out: 5D output tensor with shape (batch_size, channels, out_depth, out_height, out_width) when layout is NCDHW. out_depth, out_height and out_width are calculated as:

    out_depth = floor((depth+2*padding[0]-pool_size[0])/strides[0])+1
    out_height = floor((height+2*padding[1]-pool_size[1])/strides[1])+1
    out_width = floor((width+2*padding[2]-pool_size[2])/strides[2])+1
    

    When ceil_mode is True, ceil will be used instead of floor in this equation.

class mxnet.gluon.nn.PReLU(alpha_initializer=, **kwargs)[source]

Parametric leaky version of a Rectified Linear Unit. <https://arxiv.org/abs/1502.01852>`_ paper.

It learns a gradient when the unit is not active

\[\begin{split}f\left(x\right) = \left\{ \begin{array}{lr} \alpha x & : x \lt 0 \\ x & : x \geq 0 \\ \end{array} \right.\\\end{split}\]

where alpha is a learned parameter.

Parameters:alpha_initializer (Initializer) – Initializer for the embeddings matrix.
Inputs:
  • data: input tensor with arbitrary shape.
Outputs:
  • out: output tensor with the same shape as data.
class mxnet.gluon.nn.ReflectionPad2D(padding=0, **kwargs)[source]

Pads the input tensor using the reflection of the input boundary.

Parameters:padding (int) – An integer padding size
Inputs:
  • data: input tensor with the shape \((N, C, H_{in}, W_{in})\).
Outputs:
  • out: output tensor with the shape \((N, C, H_{out}, W_{out})\), where

    \[ \begin{align}\begin{aligned}H_{out} = H_{in} + 2 \cdot padding\\W_{out} = W_{in} + 2 \cdot padding\end{aligned}\end{align} \]

Examples

>>> m = nn.ReflectionPad2D(3)
>>> input = mx.nd.random.normal(shape=(16, 3, 224, 224))
>>> output = m(input)
class mxnet.gluon.nn.SELU(**kwargs)[source]
Scaled Exponential Linear Unit (SELU)
“Self-Normalizing Neural Networks”, Klambauer et al, 2017 https://arxiv.org/abs/1706.02515
Inputs:
  • data: input tensor with arbitrary shape.
Outputs:
  • out: output tensor with the same shape as data.
class mxnet.gluon.nn.Swish(beta=1.0, **kwargs)[source]
Swish Activation function
https://arxiv.org/pdf/1710.05941.pdf
Parameters:beta (float) – swish(x) = x * sigmoid(beta*x)
Inputs:
  • data: input tensor with arbitrary shape.
Outputs:
  • out: output tensor with the same shape as data.