Optimization: initialize and update weights

Overview

This document summaries the APIs used to initialize and update the model weights during training

mxnet.initializer Weight initializer.
mxnet.optimizer Optimizer API of MXNet.
mxnet.lr_scheduler Scheduling learning rate.

and how to develop a new optimization algorithm in MXNet.

Assume there there is a pre-defined Symbol and a Module is created for it

>>> data = mx.symbol.Variable('data')
>>> label = mx.symbol.Variable('softmax_label')
>>> fc = mx.symbol.FullyConnected(data, name='fc', num_hidden=10)
>>> loss = mx.symbol.SoftmaxOutput(fc, label, name='softmax')
>>> mod = mx.mod.Module(loss)
>>> mod.bind(data_shapes=[('data', (128,20))], label_shapes=[('softmax_label', (128,))])

Next we can initialize the weights with values sampled uniformly from [-1,1]:

>>> mod.init_params(mx.initializer.Uniform(scale=1.0))

Then we will train a model with standard SGD which decreases the learning rate by multiplying 0.9 for each 100 batches.

>>> lr_sch = mx.lr_scheduler.FactorScheduler(step=100, factor=0.9)
>>> mod.init_optimizer(
...     optimizer='sgd', optimizer_params=(('learning_rate', 0.1), ('lr_scheduler', lr_sch)))

Finally run mod.fit(...) to start training.

The mxnet.initializer package

The base class Initializer defines the default behaviors to initialize various parameters, such as set bias to 1, except for the weight. Other classes then defines how to initialize the weight.

Initializer The base class of an initializer.
Uniform Initializes weights with random values uniformly sampled from a given range.
Normal Initializes weights with random values sampled from a normal distribution with a mean of zero and standard deviation of sigma.
Load Initializes variables by loading data from file or dict.
Mixed Initialize parameters using multiple initializers.
Zero Initializes weights to zero.
One Initializes weights to one.
Constant Initializes the weights to a given value.
Orthogonal Initialize weight as orthogonal matrix.
Xavier Returns an initializer performing “Xavier” initialization for weights.
MSRAPrelu Initialize the weight according to a MSRA paper.
Bilinear Initialize weight for upsampling layers.
FusedRNN Initialize parameters for fused rnn layers.

The mxnet.optimizer package

The base class Optimizer accepts commonly shared arguments such as learning_rate and defines the interface. Each other class in this package implements one weight updating function.

Optimizer The base class inherited by all optimizers.
SGD The SGD optimizer with momentum and weight decay.
NAG Nesterov accelerated gradient.
RMSProp The RMSProp optimizer.
Adam The Adam optimizer.
AdaGrad AdaGrad optimizer.
AdaDelta The AdaDelta optimizer.
Adamax The AdaMax optimizer.
Nadam The Nesterov Adam optimizer.
DCASGD The DCASGD optimizer.
SGLD Stochastic Gradient Riemannian Langevin Dynamics.
Signum The Signum optimizer that takes the sign of gradient or momentum.
FTML The FTML optimizer.
LBSGD The Large Batch SGD optimizer with momentum and weight decay.
Ftrl The Ftrl optimizer.

The mxnet.lr_scheduler package

The base class LRScheduler defines the interface, while other classes implement various schemes to change the learning rate during training.

LRScheduler Base class of a learning rate scheduler.
FactorScheduler Reduce the learning rate by a factor for every n steps.
MultiFactorScheduler Reduce the learning rate by given a list of steps.

Implement a new algorithm

Most classes listed in this document are implemented in Python by using NDArray. So implementing new weight updating or initialization functions is straightforward.

For initializer, create a subclass of Initializer and define the _init_weight method. We can also change the default behaviors to initialize other parameters such as _init_bias. See initializer.py for examples.

For optimizer, create a subclass of Optimizer and implement two methods create_state and update. Also add @mx.optimizer.Optimizer.register before this class. See optimizer.py for examples.

For lr_scheduler, create a subclass of LRScheduler and then implement the __call__ method. See lr_scheduler.py for examples.

API Reference

Optimizer API of MXNet.

class mxnet.optimizer.AdaDelta(rho=0.9, epsilon=1e-05, **kwargs)[source]

The AdaDelta optimizer.

This class implements AdaDelta, an optimizer described in ADADELTA: An adaptive learning rate method, available at https://arxiv.org/abs/1212.5701.

This optimizer updates each weight by:

grad = clip(grad * rescale_grad + wd * weight, clip_gradient)
acc_grad = rho * acc_grad + (1. - rho) * grad * grad
delta = sqrt(acc_delta + epsilon) / sqrt(acc_grad + epsilon) * grad
acc_delta = rho * acc_delta + (1. - rho) * delta * delta
weight -= (delta + wd * weight)

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

Parameters:
  • rho (float) – Decay rate for both squared gradients and delta.
  • epsilon (float) – Small value to avoid division by 0.
class mxnet.optimizer.AdaGrad(eps=1e-07, **kwargs)[source]

AdaGrad optimizer.

This class implements the AdaGrad optimizer described in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, and available at http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.

This optimizer updates each weight by:

grad = clip(grad * rescale_grad, clip_gradient)
history += square(grad)
div = grad / sqrt(history + float_stable_eps)
weight += (div + weight * wd) * -lr

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

Parameters:eps (float, optional) – Initial value of the history accumulator. Avoids division by 0.
class mxnet.optimizer.Adam(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, lazy_update=True, **kwargs)[source]

The Adam optimizer.

This class implements the optimizer described in Adam: A Method for Stochastic Optimization, available at http://arxiv.org/abs/1412.6980.

If the storage types of grad is row_sparse, and lazy_update is True, lazy updates at step t are applied by:

for row in grad.indices:
    rescaled_grad[row] = clip(grad[row] * rescale_grad + wd * weight[row], clip_gradient)
    m[row] = beta1 * m[row] + (1 - beta1) * rescaled_grad[row]
    v[row] = beta2 * v[row] + (1 - beta2) * (rescaled_grad[row]**2)
    lr = learning_rate * sqrt(1 - beta1**t) / (1 - beta2**t)
    w[row] = w[row] - lr * m[row] / (sqrt(v[row]) + epsilon)

The lazy update only updates the mean and var for the weights whose row_sparse gradient indices appear in the current batch, rather than updating it for all indices. Compared with the original update, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original update, and may lead to different empirical results.

Otherwise, standard updates at step t are applied by:

rescaled_grad = clip(grad * rescale_grad + wd * weight, clip_gradient)
m = beta1 * m + (1 - beta1) * rescaled_grad
v = beta2 * v + (1 - beta2) * (rescaled_grad**2)
lr = learning_rate * sqrt(1 - beta1**t) / (1 - beta2**t)
w = w - lr * m / (sqrt(v) + epsilon)

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

For details of the update algorithm, see adam_update.

Parameters:
  • beta1 (float, optional) – Exponential decay rate for the first moment estimates.
  • beta2 (float, optional) – Exponential decay rate for the second moment estimates.
  • epsilon (float, optional) – Small value to avoid division by 0.
  • lazy_update (bool, optional) – Default is True. If True, lazy updates are applied if the storage types of weight and grad are both row_sparse.
class mxnet.optimizer.Adamax(learning_rate=0.002, beta1=0.9, beta2=0.999, **kwargs)[source]

The AdaMax optimizer.

It is a variant of Adam based on the infinity norm available at http://arxiv.org/abs/1412.6980 Section 7.

The optimizer updates the weight by:

grad = clip(grad * rescale_grad + wd * weight, clip_gradient)
m = beta1 * m_t + (1 - beta1) * grad
u = maximum(beta2 * u, abs(grad))
weight -= lr / (1 - beta1**t) * m / u

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

Parameters:
  • beta1 (float, optional) – Exponential decay rate for the first moment estimates.
  • beta2 (float, optional) – Exponential decay rate for the second moment estimates.
class mxnet.optimizer.DCASGD(momentum=0.0, lamda=0.04, **kwargs)[source]

The DCASGD optimizer.

This class implements the optimizer described in Asynchronous Stochastic Gradient Descent with Delay Compensation for Distributed Deep Learning, available at https://arxiv.org/abs/1609.08326.

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

Parameters:
  • momentum (float, optional) – The momentum value.
  • lamda (float, optional) – Scale DC value.
class mxnet.optimizer.FTML(beta1=0.6, beta2=0.999, epsilon=1e-08, **kwargs)[source]

The FTML optimizer.

This class implements the optimizer described in FTML - Follow the Moving Leader in Deep Learning, available at http://proceedings.mlr.press/v70/zheng17a/zheng17a.pdf.

Denote time step by t. The optimizer updates the weight by:

rescaled_grad = clip(grad * rescale_grad + wd * weight, clip_gradient)
v = beta2 * v + (1 - beta2) * square(rescaled_grad)
d_t = (1 - power(beta1, t)) / lr * square_root(v / (1 - power(beta2, t))) + epsilon)
z = beta1 * z + (1 - beta1) * rescaled_grad - (d_t - beta1 * d_(t-1)) * weight
weight = - z / d_t

For details of the update algorithm, see ftml_update.

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

Parameters:
  • beta1 (float, optional) – 0 < beta1 < 1. Generally close to 0.5.
  • beta2 (float, optional) – 0 < beta2 < 1. Generally close to 1.
  • epsilon (float, optional) – Small value to avoid division by 0.
class mxnet.optimizer.Ftrl(lamda1=0.01, learning_rate=0.1, beta=1, **kwargs)[source]

The Ftrl optimizer.

Referenced from Ad Click Prediction: a View from the Trenches, available at http://dl.acm.org/citation.cfm?id=2488200.

eta :
\[\eta_{t,i} = \frac{learningrate}{\beta+\sqrt{\sum_{s=1}^tg_{s,i}^2}}\]

The optimizer updates the weight by:

rescaled_grad = clip(grad * rescale_grad, clip_gradient)
z += rescaled_grad - (sqrt(n + rescaled_grad**2) - sqrt(n)) * weight / learning_rate
n += rescaled_grad**2
w = (sign(z) * lamda1 - z) / ((beta + sqrt(n)) / learning_rate + wd) * (abs(z) > lamda1)

If the storage types of weight, state and grad are all row_sparse, sparse updates are applied by:

for row in grad.indices:
    rescaled_grad[row] = clip(grad[row] * rescale_grad, clip_gradient)
    z[row] += rescaled_grad[row] - (sqrt(n[row] + rescaled_grad[row]**2) - sqrt(n[row])) * weight[row] / learning_rate
    n[row] += rescaled_grad[row]**2
    w[row] = (sign(z[row]) * lamda1 - z[row]) / ((beta + sqrt(n[row])) / learning_rate + wd) * (abs(z[row]) > lamda1)

The sparse update only updates the z and n for the weights whose row_sparse gradient indices appear in the current batch, rather than updating it for all indices. Compared with the original update, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original update, and may lead to different empirical results.

For details of the update algorithm, see ftrl_update.

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

Parameters:
  • lamda1 (float, optional) – L1 regularization coefficient.
  • learning_rate (float, optional) – The initial learning rate.
  • beta (float, optional) – Per-coordinate learning rate correlation parameter.
class mxnet.optimizer.LBSGD(momentum=0.0, multi_precision=False, warmup_strategy='linear', warmup_epochs=5, batch_scale=1, updates_per_epoch=32, begin_epoch=0, num_epochs=60, **kwargs)[source]

The Large Batch SGD optimizer with momentum and weight decay.

The optimizer updates the weight by:

state = momentum * state + lr * rescale_grad * clip(grad, clip_gradient) + wd * weight
weight = weight - state

For details of the update algorithm see sgd_update and sgd_mom_update. In addition to the SGD updates the LBSGD optimizer uses the LARS, Layer-wise Adaptive Rate Scaling, algorithm to have a separate learning rate for each layer of the network, which leads to better stability over large batch sizes.

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

Parameters:
  • momentum (float, optional) – The momentum value.
  • multi_precision (bool, optional) – Flag to control the internal precision of the optimizer. False: results in using the same precision as the weights (default), True: makes internal 32-bit copy of the weights and applies gradients in 32-bit precision even if actual weights used in the model have lower precision. Turning this on can improve convergence and accuracy when training with float16.
  • warmup_strategy (string ('linear', 'power2', 'sqrt'. , 'lars' default : 'linear')) –
  • warmup_epochs (unsigned, default: 5) –
  • batch_scale (unsigned, default: 1 (same as batch size*numworkers)) –
  • updates_per_epoch (updates_per_epoch (default: 32, Default might not reflect true number batches per epoch. Used for warmup.)) –
  • begin_epoch (unsigned, default 0, starting epoch.) –
class mxnet.optimizer.NAG(momentum=0.0, **kwargs)[source]

Nesterov accelerated gradient.

This optimizer updates each weight by:

state = momentum * state + grad + wd * weight
weight = weight - (lr * (grad + momentum * state))
Parameters:
  • momentum (float, optional) – The momentum value.
  • multi_precision (bool, optional) – Flag to control the internal precision of the optimizer. False: results in using the same precision as the weights (default), True: makes internal 32-bit copy of the weights and applies gradients in 32-bit precision even if actual weights used in the model have lower precision. Turning this on can improve convergence and accuracy when training with float16.
class mxnet.optimizer.Nadam(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, schedule_decay=0.004, **kwargs)[source]

The Nesterov Adam optimizer.

Much like Adam is essentially RMSprop with momentum, Nadam is Adam RMSprop with Nesterov momentum available at http://cs229.stanford.edu/proj2015/054_report.pdf.

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

Parameters:
  • beta1 (float, optional) – Exponential decay rate for the first moment estimates.
  • beta2 (float, optional) – Exponential decay rate for the second moment estimates.
  • epsilon (float, optional) – Small value to avoid division by 0.
  • schedule_decay (float, optional) – Exponential decay rate for the momentum schedule
class mxnet.optimizer.Optimizer(rescale_grad=1.0, param_idx2name=None, wd=0.0, clip_gradient=None, learning_rate=0.01, lr_scheduler=None, sym=None, begin_num_update=0, multi_precision=False, param_dict=None)[source]

The base class inherited by all optimizers.

Parameters:
  • rescale_grad (float, optional, default 1.0) – Multiply the gradient with rescale_grad before updating. Often choose to be 1.0/batch_size.
  • param_idx2name (dict from int to string, optional, default None) – A dictionary that maps int index to string name.
  • clip_gradient (float, optional, default None) – Clip the gradient by projecting onto the box [-clip_gradient, clip_gradient].
  • learning_rate (float) – The initial learning rate.
  • lr_scheduler (LRScheduler, optional, default None) – The learning rate scheduler.
  • wd (float, optional, default 0.0) – The weight decay (or L2 regularization) coefficient. Modifies objective by adding a penalty for having large weights.
  • sym (Symbol, optional, default None) – The Symbol this optimizer is applying to.
  • begin_num_update (int, optional, default 0) – The initial number of updates.
  • multi_precision (bool, optional, default False) – Flag to control the internal precision of the optimizer. False: results in using the same precision as the weights (default), True: makes internal 32-bit copy of the weights and applies gradients in 32-bit precision even if actual weights used in the model have lower precision. Turning this on can improve convergence and accuracy when training with float16.
  • param_dict (dict of int -> gluon.Parameter, default None) – Dictionary of parameter index to gluon.Parameter, used to lookup parameter attributes such as lr_mult, wd_mult, etc. param_dict shall not be deep copied.
  • Properties
  • ----------
  • learning_rate – The current learning rate of the optimizer. Given an Optimizer object optimizer, its learning rate can be accessed as optimizer.learning_rate.
static create_optimizer(name, **kwargs)[source]

Instantiates an optimizer with a given name and kwargs.

Note

We can use the alias create for Optimizer.create_optimizer.

Parameters:
  • name (str) – Name of the optimizer. Should be the name of a subclass of Optimizer. Case insensitive.
  • kwargs (dict) – Parameters for the optimizer.
Returns:

An instantiated optimizer.

Return type:

Optimizer

Examples

>>> sgd = mx.optimizer.Optimizer.create_optimizer('sgd')
>>> type(sgd)

>>> adam = mx.optimizer.create('adam', learning_rate=.1)
>>> type(adam)

create_state(index, weight)[source]

Creates auxiliary state for a given weight.

Some optimizers require additional states, e.g. as momentum, in addition to gradients in order to update weights. This function creates state for a given weight which will be used in update. This function is called only once for each weight.

Parameters:
  • index (int) – An unique index to identify the weight.
  • weight (NDArray) – The weight.
Returns:

state – The state associated with the weight.

Return type:

any obj

create_state_multi_precision(index, weight)[source]

Creates auxiliary state for a given weight, including FP32 high precision copy if original weight is FP16.

This method is provided to perform automatic mixed precision training for optimizers that do not support it themselves.

Parameters:
  • index (int) – An unique index to identify the weight.
  • weight (NDArray) – The weight.
Returns:

state – The state associated with the weight.

Return type:

any obj

static register(klass)[source]

Registers a new optimizer.

Once an optimizer is registered, we can create an instance of this optimizer with create_optimizer later.

Examples

>>> @mx.optimizer.Optimizer.register
... class MyOptimizer(mx.optimizer.Optimizer):
...     pass
>>> optim = mx.optimizer.Optimizer.create_optimizer('MyOptimizer')
>>> print(type(optim))

set_learning_rate(lr)[source]

Sets a new learning rate of the optimizer.

Parameters:lr (float) – The new learning rate of the optimizer.
set_lr_mult(args_lr_mult)[source]

Sets an individual learning rate multiplier for each parameter.

If you specify a learning rate multiplier for a parameter, then the learning rate for the parameter will be set as the product of the global learning rate self.lr and its multiplier.

Note

The default learning rate multiplier of a Variable can be set with lr_mult argument in the constructor.

Parameters:args_lr_mult (dict of str/int to float) –

For each of its key-value entries, the learning rate multipler for the parameter specified in the key will be set as the given value.

You can specify the parameter with either its name or its index. If you use the name, you should pass sym in the constructor, and the name you specified in the key of args_lr_mult should match the name of the parameter in sym. If you use the index, it should correspond to the index of the parameter used in the update method.

Specifying a parameter by its index is only supported for backward compatibility, and we recommend to use the name instead.

set_lr_scale(args_lrscale)[source]

[DEPRECATED] Sets lr scale. Use set_lr_mult instead.

set_wd_mult(args_wd_mult)[source]

Sets an individual weight decay multiplier for each parameter.

By default, if param_idx2name was provided in the constructor, the weight decay multipler is set as 0 for all parameters whose name don’t end with _weight or _gamma.

Note

The default weight decay multiplier for a Variable can be set with its wd_mult argument in the constructor.

Parameters:args_wd_mult (dict of string/int to float) –

For each of its key-value entries, the weight decay multipler for the parameter specified in the key will be set as the given value.

You can specify the parameter with either its name or its index. If you use the name, you should pass sym in the constructor, and the name you specified in the key of args_lr_mult should match the name of the parameter in sym. If you use the index, it should correspond to the index of the parameter used in the update method.

Specifying a parameter by its index is only supported for backward compatibility, and we recommend to use the name instead.

update(index, weight, grad, state)[source]

Updates the given parameter using the corresponding gradient and state.

Parameters:
  • index (int) – The unique index of the parameter into the individual learning rates and weight decays. Learning rates and weight decay may be set via set_lr_mult() and set_wd_mult(), respectively.
  • weight (NDArray) – The parameter to be updated.
  • grad (NDArray) – The gradient of the objective with respect to this parameter.
  • state (any obj) – The state returned by create_state().
update_multi_precision(index, weight, grad, state)[source]

Updates the given parameter using the corresponding gradient and state. Mixed precision version.

Parameters:
  • index (int) – The unique index of the parameter into the individual learning rates and weight decays. Learning rates and weight decay may be set via set_lr_mult() and set_wd_mult(), respectively.
  • weight (NDArray) – The parameter to be updated.
  • grad (NDArray) – The gradient of the objective with respect to this parameter.
  • state (any obj) – The state returned by create_state().
class mxnet.optimizer.RMSProp(learning_rate=0.001, gamma1=0.9, gamma2=0.9, epsilon=1e-08, centered=False, clip_weights=None, **kwargs)[source]

The RMSProp optimizer.

Two versions of RMSProp are implemented:

If centered=False, we follow http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf by Tieleman & Hinton, 2012. For details of the update algorithm see rmsprop_update.

If centered=True, we follow http://arxiv.org/pdf/1308.0850v5.pdf (38)-(45) by Alex Graves, 2013. For details of the update algorithm see rmspropalex_update.

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

Parameters:
  • gamma1 (float, optional) – A decay factor of moving average over past squared gradient.
  • gamma2 (float, optional) – A “momentum” factor. Only used if centered`=``True`.
  • epsilon (float, optional) – Small value to avoid division by 0.
  • centered (bool, optional) –

    Flag to control which version of RMSProp to use.:

    True: will use Graves's version of `RMSProp`,
    False: will use Tieleman & Hinton's version of `RMSProp`.
    
  • clip_weights (float, optional) – Clips weights into range [-clip_weights, clip_weights].
class mxnet.optimizer.SGD(momentum=0.0, lazy_update=True, **kwargs)[source]

The SGD optimizer with momentum and weight decay.

If the storage types of grad is row_sparse and lazy_update is True, lazy updates are applied by:

for row in grad.indices:
    rescaled_grad[row] = lr * (rescale_grad * clip(grad[row], clip_gradient) + wd * weight[row])
    state[row] = momentum[row] * state[row] + rescaled_grad[row]
    weight[row] = weight[row] - state[row]

The sparse update only updates the momentum for the weights whose row_sparse gradient indices appear in the current batch, rather than updating it for all indices. Compared with the original update, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original update, and may lead to different empirical results.

In the case when update_on_kvstore is set to False (either globally via MXNET_UPDATE_ON_KVSTORE=0 environment variable or as a parameter in Trainer) SGD optimizer can perform aggregated update of parameters, which may lead to improved performance. The aggregation size is controlled by MXNET_OPTIMIZER_AGGREGATION_SIZE environment variable and defaults to 4.

Otherwise, standard updates are applied by:

rescaled_grad = lr * (rescale_grad * clip(grad, clip_gradient) + wd * weight)
state = momentum * state + rescaled_grad
weight = weight - state

For details of the update algorithm see sgd_update and sgd_mom_update.

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

Parameters:
  • momentum (float, optional) – The momentum value.
  • lazy_update (bool, optional) – Default is True. If True, lazy updates are applied if the storage types of weight and grad are both row_sparse.
  • multi_precision (bool, optional) – Flag to control the internal precision of the optimizer. False: results in using the same precision as the weights (default), True: makes internal 32-bit copy of the weights and applies gradients in 32-bit precision even if actual weights used in the model have lower precision. Turning this on can improve convergence and accuracy when training with float16.
class mxnet.optimizer.SGLD(**kwargs)[source]

Stochastic Gradient Riemannian Langevin Dynamics.

This class implements the optimizer described in the paper Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex, available at https://papers.nips.cc/paper/4883-stochastic-gradient-riemannian-langevin-dynamics-on-the-probability-simplex.pdf.

class mxnet.optimizer.Signum(learning_rate=0.01, momentum=0.9, wd_lh=0.0, **kwargs)[source]

The Signum optimizer that takes the sign of gradient or momentum.

The optimizer updates the weight by:

rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight
state = momentum * state + (1-momentum)*rescaled_grad
weight = (1 - lr * wd_lh) * weight - lr * sign(state)

References

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli & Anima Anandkumar. (2018). signSGD: Compressed Optimisation for Non-Convex Problems. In ICML‘18.

See: https://arxiv.org/abs/1802.04434

For details of the update algorithm see signsgd_update and signum_update.

This optimizer accepts the following parameters in addition to those accepted by Optimizer.

Parameters:
  • momentum (float, optional) – The momentum value.
  • wd_lh (float, optional) – The amount of decoupled weight decay regularization, see details in the original paper at:https://arxiv.org/abs/1711.05101
class mxnet.optimizer.Test(**kwargs)[source]

The Test optimizer

create_state(index, weight)[source]

Creates a state to duplicate weight.

update(index, weight, grad, state)[source]

Performs w += rescale_grad * grad.

class mxnet.optimizer.Updater(optimizer)[source]

Updater for kvstore.

get_states(dump_optimizer=False)[source]

Gets updater states.

Parameters:dump_optimizer (bool, default False) – Whether to also save the optimizer itself. This would also save optimizer information such as learning rate and weight decay schedules.
set_states(states)[source]

Sets updater states.

sync_state_context(state, context)[source]

sync state context.

class mxnet.optimizer.ccSGD(*args, **kwargs)[source]

[DEPRECATED] Same as SGD. Left here for backward compatibility.

mxnet.optimizer.create(name, **kwargs)

Instantiates an optimizer with a given name and kwargs.

Note

We can use the alias create for Optimizer.create_optimizer.

Parameters:
  • name (str) – Name of the optimizer. Should be the name of a subclass of Optimizer. Case insensitive.
  • kwargs (dict) – Parameters for the optimizer.
Returns:

An instantiated optimizer.

Return type:

Optimizer

Examples

>>> sgd = mx.optimizer.Optimizer.create_optimizer('sgd')
>>> type(sgd)

>>> adam = mx.optimizer.create('adam', learning_rate=.1)
>>> type(adam)

mxnet.optimizer.get_updater(optimizer)[source]

Returns a closure of the updater needed for kvstore.

Parameters:optimizer (Optimizer) – The optimizer.
Returns:updater – The closure of the updater.
Return type:function
mxnet.optimizer.register(klass)

Registers a new optimizer.

Once an optimizer is registered, we can create an instance of this optimizer with create_optimizer later.

Examples

>>> @mx.optimizer.Optimizer.register
... class MyOptimizer(mx.optimizer.Optimizer):
...     pass
>>> optim = mx.optimizer.Optimizer.create_optimizer('MyOptimizer')
>>> print(type(optim))

Scheduling learning rate.

class mxnet.lr_scheduler.LRScheduler(base_lr=0.01, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear')[source]

Base class of a learning rate scheduler.

A scheduler returns a new learning rate based on the number of updates that have been performed.

Parameters:
  • base_lr (float, optional) – The initial learning rate.
  • warmup_steps (int) – number of warmup steps used before this scheduler starts decay
  • warmup_begin_lr (float) – if using warmup, the learning rate from which it starts warming up
  • warmup_mode (string) – warmup can be done in two modes. ‘linear’ mode gradually increases lr with each step in equal increments ‘constant’ mode keeps lr at warmup_begin_lr for warmup_steps
class mxnet.lr_scheduler.FactorScheduler(step, factor=1, stop_factor_lr=1e-08, base_lr=0.01, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear')[source]

Reduce the learning rate by a factor for every n steps.

It returns a new learning rate by:

base_lr * pow(factor, floor(num_update/step))
Parameters:
  • step (int) – Changes the learning rate for every n updates.
  • factor (float, optional) – The factor to change the learning rate.
  • stop_factor_lr (float, optional) – Stop updating the learning rate if it is less than this value.
class mxnet.lr_scheduler.MultiFactorScheduler(step, factor=1, base_lr=0.01, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear')[source]

Reduce the learning rate by given a list of steps.

Assume there exists k such that:

step[k] <= num_update and num_update < step[k+1]

Then calculate the new learning rate by:

base_lr * pow(factor, k+1)
Parameters:
  • step (list of int) – The list of steps to schedule a change
  • factor (float) – The factor to change the learning rate.
  • warmup_steps (int) – number of warmup steps used before this scheduler starts decay
  • warmup_begin_lr (float) – if using warmup, the learning rate from which it starts warming up
  • warmup_mode (string) – warmup can be done in two modes. ‘linear’ mode gradually increases lr with each step in equal increments ‘constant’ mode keeps lr at warmup_begin_lr for warmup_steps
class mxnet.lr_scheduler.PolyScheduler(max_update, base_lr=0.01, pwr=2, final_lr=0, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear')[source]

Reduce the learning rate according to a polynomial of given power.

Calculate the new learning rate, after warmup if any, by:

final_lr + (start_lr - final_lr) * (1-nup/max_nup)^pwr
if nup < max_nup, 0 otherwise.
Parameters:
  • max_update (int) – maximum number of updates before the decay reaches final learning rate.
  • base_lr (float) – base learning rate to start from
  • pwr (int) – power of the decay term as a function of the current number of updates.
  • final_lr (float) – final learning rate after all steps
  • warmup_steps (int) – number of warmup steps used before this scheduler starts decay
  • warmup_begin_lr (float) – if using warmup, the learning rate from which it starts warming up
  • warmup_mode (string) – warmup can be done in two modes. ‘linear’ mode gradually increases lr with each step in equal increments ‘constant’ mode keeps lr at warmup_begin_lr for warmup_steps
class mxnet.lr_scheduler.CosineScheduler(max_update, base_lr=0.01, final_lr=0, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear')[source]

Reduce the learning rate according to a cosine function

Calculate the new learning rate by:

final_lr + (start_lr - final_lr) * (1+cos(pi * nup/max_nup))/2
if nup < max_nup, 0 otherwise.
Parameters:
  • max_update (int) – maximum number of updates before the decay reaches 0
  • base_lr (float) – base learning rate
  • final_lr (float) – final learning rate after all steps
  • warmup_steps (int) – number of warmup steps used before this scheduler starts decay
  • warmup_begin_lr (float) – if using warmup, the learning rate from which it starts warming up
  • warmup_mode (string) – warmup can be done in two modes. ‘linear’ mode gradually increases lr with each step in equal increments ‘constant’ mode keeps lr at warmup_begin_lr for warmup_steps

Weight initializer.

class mxnet.initializer.InitDesc[source]

Descriptor for the initialization pattern.

Parameters:
  • name (str) – Name of variable.
  • attrs (dict of str to str) – Attributes of this variable taken from Symbol.attr_dict.
  • global_init (Initializer) – Global initializer to fallback to.
class mxnet.initializer.Initializer(**kwargs)[source]

The base class of an initializer.

set_verbosity(verbose=False, print_func=None)[source]

Switch on/off verbose mode

Parameters:
  • verbose (bool) – switch on/off verbose mode
  • print_func (function) – A function that computes statistics of initialized arrays. Takes an NDArray and returns an str. Defaults to mean absolute value str((abs(x)/size(x)).asscalar()).
dumps()[source]

Saves the initializer to string

Returns:JSON formatted string that describes the initializer.
Return type:str

Examples

>>> # Create initializer and retrieve its parameters
...
>>> init = mx.init.Normal(0.5)
>>> init.dumps()
'["normal", {"sigma": 0.5}]'
>>> init = mx.init.Xavier(factor_type="in", magnitude=2.34)
>>> init.dumps()
'["xavier", {"rnd_type": "uniform", "magnitude": 2.34, "factor_type": "in"}]'
mxnet.initializer.register(klass)[source]

Registers a custom initializer.

Custom initializers can be created by extending mx.init.Initializer and implementing the required functions like _init_weight and _init_bias. The created initializer must be registered using mx.init.register before it can be called by name.

Parameters:klass (class) – A subclass of mx.init.Initializer that needs to be registered as a custom initializer.

Example

>>> # Create and register a custom initializer that
... # initializes weights to 0.1 and biases to 1.
...
>>> @mx.init.register
... @alias('myinit')
... class CustomInit(mx.init.Initializer):
...   def __init__(self):
...     super(CustomInit, self).__init__()
...   def _init_weight(self, _, arr):
...     arr[:] = 0.1
...   def _init_bias(self, _, arr):
...     arr[:] = 1
...
>>> # Module is an instance of 'mxnet.module.Module'
...
>>> module.init_params("custominit")
>>> # module.init_params("myinit")
>>> # module.init_params(CustomInit())
class mxnet.initializer.Load(param, default_init=None, verbose=False)[source]

Initializes variables by loading data from file or dict.

Note Load will drop arg: or aux: from name and initialize the variables that match with the prefix dropped.

Parameters:
  • param (str or dict of str->`NDArray`) – Parameter file or dict mapping name to NDArray.
  • default_init (Initializer) – Default initializer when name is not found in param.
  • verbose (bool) – Flag for enabling logging of source when initializing.
class mxnet.initializer.Mixed(patterns, initializers)[source]

Initialize parameters using multiple initializers.

Parameters:
  • patterns (list of str) – List of regular expressions matching parameter names.
  • initializers (list of Initializer) – List of initializers corresponding to patterns.

Example

>>> # Given 'module', an instance of 'mxnet.module.Module', initialize biases to zero
... # and every other parameter to random values with uniform distribution.
...
>>> init = mx.initializer.Mixed(['bias', '.*'], [mx.init.Zero(), mx.init.Uniform(0.1)])
>>> module.init_params(init)
>>>
>>> for dictionary in module.get_params():
...     for key in dictionary:
...         print(key)
...         print(dictionary[key].asnumpy())
...
fullyconnected1_weight
[[ 0.0097627   0.01856892  0.04303787]]
fullyconnected1_bias
[ 0.]
class mxnet.initializer.Zero[source]

Initializes weights to zero.

Example

>>> # Given 'module', an instance of 'mxnet.module.Module', initialize weights to zero.
...
>>> init = mx.initializer.Zero()
>>> module.init_params(init)
>>> for dictionary in module.get_params():
...     for key in dictionary:
...         print(key)
...         print(dictionary[key].asnumpy())
...
fullyconnected0_weight
[[ 0.  0.  0.]]
class mxnet.initializer.One[source]

Initializes weights to one.

Example

>>> # Given 'module', an instance of 'mxnet.module.Module', initialize weights to one.
...
>>> init = mx.initializer.One()
>>> module.init_params(init)
>>> for dictionary in module.get_params():
...     for key in dictionary:
...         print(key)
...         print(dictionary[key].asnumpy())
...
fullyconnected0_weight
[[ 1.  1.  1.]]
class mxnet.initializer.Constant(value)[source]

Initializes the weights to a given value. The value passed in can be a scalar or a NDarray that matches the shape of the parameter to be set.

Parameters:value (float, NDArray) – Value to set.
class mxnet.initializer.Uniform(scale=0.07)[source]

Initializes weights with random values uniformly sampled from a given range.

Parameters:scale (float, optional) – The bound on the range of the generated random values. Values are generated from the range [-scale, scale]. Default scale is 0.07.

Example

>>> # Given 'module', an instance of 'mxnet.module.Module', initialize weights
>>> # to random values uniformly sampled between -0.1 and 0.1.
...
>>> init = mx.init.Uniform(0.1)
>>> module.init_params(init)
>>> for dictionary in module.get_params():
...     for key in dictionary:
...         print(key)
...         print(dictionary[key].asnumpy())
...
fullyconnected0_weight
[[ 0.01360891 -0.02144304  0.08511933]]
class mxnet.initializer.Normal(sigma=0.01)[source]

Initializes weights with random values sampled from a normal distribution with a mean of zero and standard deviation of sigma.

Parameters:sigma (float, optional) – Standard deviation of the normal distribution. Default standard deviation is 0.01.

Example

>>> # Given 'module', an instance of 'mxnet.module.Module', initialize weights
>>> # to random values sampled from a normal distribution.
...
>>> init = mx.init.Normal(0.5)
>>> module.init_params(init)
>>> for dictionary in module.get_params():
...     for key in dictionary:
...         print(key)
...         print(dictionary[key].asnumpy())
...
fullyconnected0_weight
[[-0.3214761  -0.12660924  0.53789419]]
class mxnet.initializer.Orthogonal(scale=1.414, rand_type='uniform')[source]

Initialize weight as orthogonal matrix.

This initializer implements Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, available at https://arxiv.org/abs/1312.6120.

Parameters:
  • scale (float optional) – Scaling factor of weight.
  • rand_type (string optional) – Use “uniform” or “normal” random number to initialize weight.
class mxnet.initializer.Xavier(rnd_type='uniform', factor_type='avg', magnitude=3)[source]

Returns an initializer performing “Xavier” initialization for weights.

This initializer is designed to keep the scale of gradients roughly the same in all layers.

By default, rnd_type is 'uniform' and factor_type is 'avg', the initializer fills the weights with random numbers in the range of \([-c, c]\), where \(c = \sqrt{\frac{3.}{0.5 * (n_{in} + n_{out})}}\). \(n_{in}\) is the number of neurons feeding into weights, and \(n_{out}\) is the number of neurons the result is fed to.

If rnd_type is 'uniform' and factor_type is 'in', the \(c = \sqrt{\frac{3.}{n_{in}}}\). Similarly when factor_type is 'out', the \(c = \sqrt{\frac{3.}{n_{out}}}\).

If rnd_type is 'gaussian' and factor_type is 'avg', the initializer fills the weights with numbers from normal distribution with a standard deviation of \(\sqrt{\frac{3.}{0.5 * (n_{in} + n_{out})}}\).

Parameters:
  • rnd_type (str, optional) – Random generator type, can be 'gaussian' or 'uniform'.
  • factor_type (str, optional) – Can be 'avg', 'in', or 'out'.
  • magnitude (float, optional) – Scale of random number.
class mxnet.initializer.MSRAPrelu(factor_type='avg', slope=0.25)[source]

Initialize the weight according to a MSRA paper.

This initializer implements Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, available at https://arxiv.org/abs/1502.01852.

This initializer is proposed for initialization related to ReLu activation, it maked some changes on top of Xavier method.

Parameters:
  • factor_type (str, optional) – Can be 'avg', 'in', or 'out'.
  • slope (float, optional) – initial slope of any PReLU (or similar) nonlinearities.
class mxnet.initializer.Bilinear[source]

Initialize weight for upsampling layers.

class mxnet.initializer.LSTMBias(forget_bias=1.0)[source]

Initialize all biases of an LSTMCell to 0.0 except for the forget gate whose bias is set to custom value.

Parameters:forget_bias (float, default 1.0) – bias for the forget gate. Jozefowicz et al. 2015 recommends setting this to 1.0.
class mxnet.initializer.FusedRNN(init, num_hidden, num_layers, mode, bidirectional=False, forget_bias=1.0)[source]

Initialize parameters for fused rnn layers.

Parameters:
  • init (Initializer) – initializer applied to unpacked weights. Fall back to global initializer if None.
  • num_hidden (int) – should be the same with arguments passed to FusedRNNCell.
  • num_layers (int) – should be the same with arguments passed to FusedRNNCell.
  • mode (str) – should be the same with arguments passed to FusedRNNCell.
  • bidirectional (bool) – should be the same with arguments passed to FusedRNNCell.
  • forget_bias (float) – should be the same with arguments passed to FusedRNNCell.