Optimization: initialize and update weights¶
Overview¶
This document summaries the APIs used to initialize and update the model weights during training
mxnet.initializer |
Weight initializer. |
mxnet.optimizer |
Optimizer API of MXNet. |
mxnet.lr_scheduler |
Scheduling learning rate. |
and how to develop a new optimization algorithm in MXNet.
Assume there there is a pre-defined Symbol
and a Module
is created for
it
>>> data = mx.symbol.Variable('data')
>>> label = mx.symbol.Variable('softmax_label')
>>> fc = mx.symbol.FullyConnected(data, name='fc', num_hidden=10)
>>> loss = mx.symbol.SoftmaxOutput(fc, label, name='softmax')
>>> mod = mx.mod.Module(loss)
>>> mod.bind(data_shapes=[('data', (128,20))], label_shapes=[('softmax_label', (128,))])
Next we can initialize the weights with values sampled uniformly from
[-1,1]
:
>>> mod.init_params(mx.initializer.Uniform(scale=1.0))
Then we will train a model with standard SGD which decreases the learning rate by multiplying 0.9 for each 100 batches.
>>> lr_sch = mx.lr_scheduler.FactorScheduler(step=100, factor=0.9)
>>> mod.init_optimizer(
... optimizer='sgd', optimizer_params=(('learning_rate', 0.1), ('lr_scheduler', lr_sch)))
Finally run mod.fit(...)
to start training.
The mxnet.initializer
package¶
The base class Initializer
defines the default behaviors to initialize
various parameters, such as set bias to 1, except for the weight. Other classes
then defines how to initialize the weight.
Initializer |
The base class of an initializer. |
Uniform |
Initializes weights with random values uniformly sampled from a given range. |
Normal |
Initializes weights with random values sampled from a normal distribution with a mean of zero and standard deviation of sigma. |
Load |
Initializes variables by loading data from file or dict. |
Mixed |
Initialize parameters using multiple initializers. |
Zero |
Initializes weights to zero. |
One |
Initializes weights to one. |
Constant |
Initializes the weights to a given value. |
Orthogonal |
Initialize weight as orthogonal matrix. |
Xavier |
Returns an initializer performing “Xavier” initialization for weights. |
MSRAPrelu |
Initialize the weight according to a MSRA paper. |
Bilinear |
Initialize weight for upsampling layers. |
FusedRNN |
Initialize parameters for fused rnn layers. |
The mxnet.optimizer
package¶
The base class Optimizer
accepts commonly shared arguments such as
learning_rate
and defines the interface. Each other class in this package
implements one weight updating function.
Optimizer |
The base class inherited by all optimizers. |
SGD |
The SGD optimizer with momentum and weight decay. |
NAG |
Nesterov accelerated SGD. |
RMSProp |
The RMSProp optimizer. |
Adam |
The Adam optimizer. |
AdaGrad |
AdaGrad optimizer. |
AdaDelta |
The AdaDelta optimizer. |
Adamax |
The AdaMax optimizer. |
Nadam |
The Nesterov Adam optimizer. |
DCASGD |
The DCASGD optimizer. |
SGLD |
Stochastic Gradient Riemannian Langevin Dynamics. |
Signum |
The Signum optimizer that takes the sign of gradient or momentum. |
FTML |
The FTML optimizer. |
LBSGD |
The Large Batch SGD optimizer with momentum and weight decay. |
Ftrl |
The Ftrl optimizer. |
The mxnet.lr_scheduler
package¶
The base class LRScheduler
defines the interface, while other classes
implement various schemes to change the learning rate during training.
LRScheduler |
Base class of a learning rate scheduler. |
FactorScheduler |
Reduce the learning rate by a factor for every n steps. |
MultiFactorScheduler |
Reduce the learning rate by given a list of steps. |
Implement a new algorithm¶
Most classes listed in this document are implemented in Python by using NDArray
.
So implementing new weight updating or initialization functions is
straightforward.
For initializer
, create a subclass of Initializer
and define the
_init_weight
method. We can also change the default behaviors to initialize
other parameters such as _init_bias
. See
initializer.py
for examples.
For optimizer
, create a subclass of Optimizer
and implement two methods create_state
and update
. Also add
@mx.optimizer.Optimizer.register
before this class. See
optimizer.py
for examples.
For lr_scheduler
, create a subclass of LRScheduler
and then implement the
__call__
method. See
lr_scheduler.py
for examples.
API Reference¶
Optimizer API of MXNet.
-
class
mxnet.optimizer.
AdaDelta
(rho=0.9, epsilon=1e-05, **kwargs)[source]¶ The AdaDelta optimizer.
This class implements AdaDelta, an optimizer described in ADADELTA: An adaptive learning rate method, available at https://arxiv.org/abs/1212.5701.
This optimizer updates each weight by:
grad = clip(grad * rescale_grad + wd * weight, clip_gradient) acc_grad = rho * acc_grad + (1. - rho) * grad * grad delta = sqrt(acc_delta + epsilon) / sqrt(acc_grad + epsilon) * grad acc_delta = rho * acc_delta + (1. - rho) * delta * delta weight -= (delta + wd * weight)
This optimizer accepts the following parameters in addition to those accepted by
Optimizer
.Parameters: - rho (float) – Decay rate for both squared gradients and delta.
- epsilon (float) – Small value to avoid division by 0.
-
class
mxnet.optimizer.
AdaGrad
(eps=1e-07, **kwargs)[source]¶ AdaGrad optimizer.
This class implements the AdaGrad optimizer described in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, and available at http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.
This optimizer updates each weight by:
grad = clip(grad * rescale_grad, clip_gradient) history += square(grad) div = grad / sqrt(history + float_stable_eps) weight += (div + weight * wd) * -lr
This optimizer accepts the following parameters in addition to those accepted by
Optimizer
.Parameters: eps (float, optional) – Initial value of the history accumulator. Avoids division by 0.
-
class
mxnet.optimizer.
Adam
(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, lazy_update=True, **kwargs)[source]¶ The Adam optimizer.
This class implements the optimizer described in Adam: A Method for Stochastic Optimization, available at http://arxiv.org/abs/1412.6980.
If the storage types of grad is
row_sparse
, andlazy_update
is True, lazy updates are applied by:for row in grad.indices: rescaled_grad[row] = clip(grad[row] * rescale_grad + wd * weight[row], clip_gradient) m[row] = beta1 * m[row] + (1 - beta1) * rescaled_grad[row] v[row] = beta2 * v[row] + (1 - beta2) * (rescaled_grad[row]**2) w[row] = w[row] - learning_rate * m[row] / (sqrt(v[row]) + epsilon)
The lazy update only updates the mean and var for the weights whose row_sparse gradient indices appear in the current batch, rather than updating it for all indices. Compared with the original update, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original update, and may lead to different empirical results.
Otherwise, standard updates are applied by:
rescaled_grad = clip(grad * rescale_grad + wd * weight, clip_gradient) m = beta1 * m + (1 - beta1) * rescaled_grad v = beta2 * v + (1 - beta2) * (rescaled_grad**2) w = w - learning_rate * m / (sqrt(v) + epsilon)
This optimizer accepts the following parameters in addition to those accepted by
Optimizer
.For details of the update algorithm, see
adam_update
.Parameters: - beta1 (float, optional) – Exponential decay rate for the first moment estimates.
- beta2 (float, optional) – Exponential decay rate for the second moment estimates.
- epsilon (float, optional) – Small value to avoid division by 0.
- lazy_update (bool, optional) – Default is True. If True, lazy updates are applied if the storage types of weight and grad are both
row_sparse
.
-
class
mxnet.optimizer.
Adamax
(learning_rate=0.002, beta1=0.9, beta2=0.999, **kwargs)[source]¶ The AdaMax optimizer.
It is a variant of Adam based on the infinity norm available at http://arxiv.org/abs/1412.6980 Section 7.
The optimizer updates the weight by:
grad = clip(grad * rescale_grad + wd * weight, clip_gradient) m = beta1 * m_t + (1 - beta1) * grad u = maximum(beta2 * u, abs(grad)) weight -= lr / (1 - beta1**t) * m / u
This optimizer accepts the following parameters in addition to those accepted by
Optimizer
.Parameters: - beta1 (float, optional) – Exponential decay rate for the first moment estimates.
- beta2 (float, optional) – Exponential decay rate for the second moment estimates.
-
class
mxnet.optimizer.
DCASGD
(momentum=0.0, lamda=0.04, **kwargs)[source]¶ The DCASGD optimizer.
This class implements the optimizer described in Asynchronous Stochastic Gradient Descent with Delay Compensation for Distributed Deep Learning, available at https://arxiv.org/abs/1609.08326.
This optimizer accepts the following parameters in addition to those accepted by
Optimizer
.Parameters: - momentum (float, optional) – The momentum value.
- lamda (float, optional) – Scale DC value.
-
class
mxnet.optimizer.
FTML
(beta1=0.6, beta2=0.999, epsilon=1e-08, **kwargs)[source]¶ The FTML optimizer.
This class implements the optimizer described in FTML - Follow the Moving Leader in Deep Learning, available at http://proceedings.mlr.press/v70/zheng17a/zheng17a.pdf.
Denote time step by t. The optimizer updates the weight by:
rescaled_grad = clip(grad * rescale_grad + wd * weight, clip_gradient) v = beta2 * v + (1 - beta2) * square(rescaled_grad) d_t = (1 - power(beta1, t)) / lr * square_root(v / (1 - power(beta2, t))) + epsilon) z = beta1 * z + (1 - beta1) * rescaled_grad - (d_t - beta1 * d_(t-1)) * weight weight = - z / d_t
This optimizer accepts the following parameters in addition to those accepted by
Optimizer
.Parameters: - beta1 (float, optional) – 0 < beta1 < 1. Generally close to 0.5.
- beta2 (float, optional) – 0 < beta2 < 1. Generally close to 1.
- epsilon (float, optional) – Small value to avoid division by 0.
-
class
mxnet.optimizer.
Ftrl
(lamda1=0.01, learning_rate=0.1, beta=1, **kwargs)[source]¶ The Ftrl optimizer.
Referenced from Ad Click Prediction: a View from the Trenches, available at http://dl.acm.org/citation.cfm?id=2488200.
- eta :
- \[\eta_{t,i} = \frac{learningrate}{\beta+\sqrt{\sum_{s=1}^tg_{s,i}^2}}\]
The optimizer updates the weight by:
rescaled_grad = clip(grad * rescale_grad, clip_gradient) z += rescaled_grad - (sqrt(n + rescaled_grad**2) - sqrt(n)) * weight / learning_rate n += rescaled_grad**2 w = (sign(z) * lamda1 - z) / ((beta + sqrt(n)) / learning_rate + wd) * (abs(z) > lamda1)
If the storage types of weight, state and grad are all
row_sparse
, sparse updates are applied by:for row in grad.indices: rescaled_grad[row] = clip(grad[row] * rescale_grad, clip_gradient) z[row] += rescaled_grad[row] - (sqrt(n[row] + rescaled_grad[row]**2) - sqrt(n[row])) * weight[row] / learning_rate n[row] += rescaled_grad[row]**2 w[row] = (sign(z[row]) * lamda1 - z[row]) / ((beta + sqrt(n[row])) / learning_rate + wd) * (abs(z[row]) > lamda1)
The sparse update only updates the z and n for the weights whose row_sparse gradient indices appear in the current batch, rather than updating it for all indices. Compared with the original update, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original update, and may lead to different empirical results.
For details of the update algorithm, see
ftrl_update
.This optimizer accepts the following parameters in addition to those accepted by
Optimizer
.Parameters: - lamda1 (float, optional) – L1 regularization coefficient.
- learning_rate (float, optional) – The initial learning rate.
- beta (float, optional) – Per-coordinate learning rate correlation parameter.
-
class
mxnet.optimizer.
LBSGD
(momentum=0.0, multi_precision=False, warmup_strategy='linear', warmup_epochs=5, batch_scale=1, updates_per_epoch=32, begin_epoch=0, num_epochs=60, **kwargs)[source]¶ The Large Batch SGD optimizer with momentum and weight decay.
The optimizer updates the weight by:
state = momentum * state + lr * rescale_grad * clip(grad, clip_gradient) + wd * weight weight = weight - state
For details of the update algorithm see
lbsgd_update
andlbsgd_mom_update
.This optimizer accepts the following parameters in addition to those accepted by
Optimizer
.Parameters: - momentum (float, optional) – The momentum value.
- multi_precision (bool, optional) –
Flag to control the internal precision of the optimizer.:
False: results in using the same precision as the weights (default), True: makes internal 32-bit copy of the weights and applies gradients in 32-bit precision even if actual weights used in the model have lower precision. Turning this on can improve convergence and accuracy when training with float16.
- warmup_strategy (string ('linear', 'power2', 'sqrt'. , 'lars' default : 'linear')) –
- warmup_epochs (unsigned, default: 5) –
- batch_scale (unsigned, default: 1 (same as batch size*numworkers)) –
- updates_per_epoch (updates_per_epoch (default: 32, Default might not reflect true number batches per epoch. Used for warmup.)) –
- begin_epoch (unsigned, default 0, starting epoch.) –
-
class
mxnet.optimizer.
NAG
(momentum=0.0, **kwargs)[source]¶ Nesterov accelerated SGD.
This optimizer updates each weight by:
state = momentum * state + grad + wd * weight weight = weight - (lr * (grad + momentum * state))
Parameters: - momentum (float, optional) – The momentum value.
- multi_precision (bool, optional) –
Flag to control the internal precision of the optimizer.:
False: results in using the same precision as the weights (default), True: makes internal 32-bit copy of the weights and applies gradients in 32-bit precision even if actual weights used in the model have lower precision. Turning this on can improve convergence and accuracy when training with float16.
-
mxnet.optimizer.
NDabs
(data=None, out=None, name=None, **kwargs)¶ Returns element-wise absolute value of the input.
Example:
abs([-2, 0, 3]) = [2, 0, 3]
The storage type of
abs
output depends upon the input storage type:- abs(default) = default
- abs(row_sparse) = row_sparse
- abs(csr) = csr
Defined in src/operator/tensor/elemwise_unary_op_basic.cc:L662
Parameters: Returns: out – The output of this function.
Return type: NDArray or list of NDArrays
-
class
mxnet.optimizer.
Nadam
(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, schedule_decay=0.004, **kwargs)[source]¶ The Nesterov Adam optimizer.
Much like Adam is essentially RMSprop with momentum, Nadam is Adam RMSprop with Nesterov momentum available at http://cs229.stanford.edu/proj2015/054_report.pdf.
This optimizer accepts the following parameters in addition to those accepted by
Optimizer
.Parameters: - beta1 (float, optional) – Exponential decay rate for the first moment estimates.
- beta2 (float, optional) – Exponential decay rate for the second moment estimates.
- epsilon (float, optional) – Small value to avoid division by 0.
- schedule_decay (float, optional) – Exponential decay rate for the momentum schedule
-
class
mxnet.optimizer.
Optimizer
(rescale_grad=1.0, param_idx2name=None, wd=0.0, clip_gradient=None, learning_rate=0.01, lr_scheduler=None, sym=None, begin_num_update=0, multi_precision=False, param_dict=None)[source]¶ The base class inherited by all optimizers.
Parameters: - rescale_grad (float, optional) – Multiply the gradient with rescale_grad before updating. Often
choose to be
1.0/batch_size
. - param_idx2name (dict from int to string, optional) – A dictionary that maps int index to string name.
- clip_gradient (float, optional) – Clip the gradient by projecting onto the box
[-clip_gradient, clip_gradient]
. - learning_rate (float) – The initial learning rate.
- lr_scheduler (LRScheduler, optional) – The learning rate scheduler.
- wd (float, optional) – The weight decay (or L2 regularization) coefficient. Modifies objective by adding a penalty for having large weights.
- sym (Symbol, optional) – The Symbol this optimizer is applying to.
- begin_num_update (int, optional) – The initial number of updates.
- multi_precision (bool, optional) –
Flag to control the internal precision of the optimizer.:
False: results in using the same precision as the weights (default), True: makes internal 32-bit copy of the weights and applies gradients in 32-bit precision even if actual weights used in the model have lower precision. Turning this on can improve convergence and accuracy when training with float16.
- Properties –
- ---------- –
- learning_rate – The current learning rate of the optimizer. Given an Optimizer object optimizer, its learning rate can be accessed as optimizer.learning_rate.
-
static
create_optimizer
(name, **kwargs)[source]¶ Instantiates an optimizer with a given name and kwargs.
Note
We can use the alias create for
Optimizer.create_optimizer
.Parameters: - name (str) – Name of the optimizer. Should be the name of a subclass of Optimizer. Case insensitive.
- kwargs (dict) – Parameters for the optimizer.
Returns: An instantiated optimizer.
Return type: Examples
>>> sgd = mx.optimizer.Optimizer.create_optimizer('sgd') >>> type(sgd)
>>> adam = mx.optimizer.create('adam', learning_rate=.1) >>> type(adam)
-
create_state
(index, weight)[source]¶ Creates auxiliary state for a given weight.
Some optimizers require additional states, e.g. as momentum, in addition to gradients in order to update weights. This function creates state for a given weight which will be used in update. This function is called only once for each weight.
Parameters: - index (int) – An unique index to identify the weight.
- weight (NDArray) – The weight.
Returns: state – The state associated with the weight.
Return type: any obj
-
create_state_multi_precision
(index, weight)[source]¶ Creates auxiliary state for a given weight, including FP32 high precision copy if original weight is FP16.
This method is provided to perform automatic mixed precision training for optimizers that do not support it themselves.
Parameters: - index (int) – An unique index to identify the weight.
- weight (NDArray) – The weight.
Returns: state – The state associated with the weight.
Return type: any obj
-
static
register
(klass)[source]¶ Registers a new optimizer.
Once an optimizer is registered, we can create an instance of this optimizer with create_optimizer later.
Examples
>>> @mx.optimizer.Optimizer.register ... class MyOptimizer(mx.optimizer.Optimizer): ... pass >>> optim = mx.optimizer.Optimizer.create_optimizer('MyOptimizer') >>> print(type(optim))
-
set_learning_rate
(lr)[source]¶ Sets a new learning rate of the optimizer.
Parameters: lr (float) – The new learning rate of the optimizer.
-
set_lr_mult
(args_lr_mult)[source]¶ Sets an individual learning rate multiplier for each parameter.
If you specify a learning rate multiplier for a parameter, then the learning rate for the parameter will be set as the product of the global learning rate self.lr and its multiplier.
Note
The default learning rate multiplier of a Variable can be set with lr_mult argument in the constructor.
Parameters: args_lr_mult (dict of str/int to float) – For each of its key-value entries, the learning rate multipler for the parameter specified in the key will be set as the given value.
You can specify the parameter with either its name or its index. If you use the name, you should pass sym in the constructor, and the name you specified in the key of args_lr_mult should match the name of the parameter in sym. If you use the index, it should correspond to the index of the parameter used in the update method.
Specifying a parameter by its index is only supported for backward compatibility, and we recommend to use the name instead.
-
set_wd_mult
(args_wd_mult)[source]¶ Sets an individual weight decay multiplier for each parameter.
By default, if param_idx2name was provided in the constructor, the weight decay multipler is set as 0 for all parameters whose name don’t end with
_weight
or_gamma
.Note
The default weight decay multiplier for a Variable can be set with its wd_mult argument in the constructor.
Parameters: args_wd_mult (dict of string/int to float) – For each of its key-value entries, the weight decay multipler for the parameter specified in the key will be set as the given value.
You can specify the parameter with either its name or its index. If you use the name, you should pass sym in the constructor, and the name you specified in the key of args_lr_mult should match the name of the parameter in sym. If you use the index, it should correspond to the index of the parameter used in the update method.
Specifying a parameter by its index is only supported for backward compatibility, and we recommend to use the name instead.
-
update
(index, weight, grad, state)[source]¶ Updates the given parameter using the corresponding gradient and state.
Parameters: - index (int) – The unique index of the parameter into the individual learning rates and weight decays. Learning rates and weight decay may be set via set_lr_mult() and set_wd_mult(), respectively.
- weight (NDArray) – The parameter to be updated.
- grad (NDArray) – The gradient of the objective with respect to this parameter.
- state (any obj) – The state returned by create_state().
-
update_multi_precision
(index, weight, grad, state)[source]¶ Updates the given parameter using the corresponding gradient and state. Mixed precision version.
Parameters: - index (int) – The unique index of the parameter into the individual learning rates and weight decays. Learning rates and weight decay may be set via set_lr_mult() and set_wd_mult(), respectively.
- weight (NDArray) – The parameter to be updated.
- grad (NDArray) – The gradient of the objective with respect to this parameter.
- state (any obj) – The state returned by create_state().
- rescale_grad (float, optional) – Multiply the gradient with rescale_grad before updating. Often
choose to be
-
class
mxnet.optimizer.
RMSProp
(learning_rate=0.001, gamma1=0.9, gamma2=0.9, epsilon=1e-08, centered=False, clip_weights=None, **kwargs)[source]¶ The RMSProp optimizer.
Two versions of RMSProp are implemented:
If
centered=False
, we follow http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf by Tieleman & Hinton, 2012. For details of the update algorithm seermsprop_update
.If
centered=True
, we follow http://arxiv.org/pdf/1308.0850v5.pdf (38)-(45) by Alex Graves, 2013. For details of the update algorithm seermspropalex_update
.This optimizer accepts the following parameters in addition to those accepted by
Optimizer
.Parameters: - gamma1 (float, optional) – A decay factor of moving average over past squared gradient.
- gamma2 (float, optional) – A “momentum” factor. Only used if centered`=``True`.
- epsilon (float, optional) – Small value to avoid division by 0.
- centered (bool, optional) –
Flag to control which version of RMSProp to use.:
True: will use Graves's version of `RMSProp`, False: will use Tieleman & Hinton's version of `RMSProp`.
- clip_weights (float, optional) – Clips weights into range
[-clip_weights, clip_weights]
.
-
class
mxnet.optimizer.
SGD
(momentum=0.0, lazy_update=True, **kwargs)[source]¶ The SGD optimizer with momentum and weight decay.
If the storage types of grad is
row_sparse
andlazy_update
is True, lazy updates are applied by:for row in grad.indices: rescaled_grad[row] = lr * (rescale_grad * clip(grad[row], clip_gradient) + wd * weight[row]) state[row] = momentum[row] * state[row] + rescaled_grad[row] weight[row] = weight[row] - state[row]
The sparse update only updates the momentum for the weights whose row_sparse gradient indices appear in the current batch, rather than updating it for all indices. Compared with the original update, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original update, and may lead to different empirical results.
Otherwise, standard updates are applied by:
rescaled_grad = lr * (rescale_grad * clip(grad, clip_gradient) + wd * weight) state = momentum * state + rescaled_grad weight = weight - state
For details of the update algorithm see
sgd_update
andsgd_mom_update
.This optimizer accepts the following parameters in addition to those accepted by
Optimizer
.Parameters: - momentum (float, optional) – The momentum value.
- lazy_update (bool, optional) – Default is True. If True, lazy updates are applied if the storage types of weight and grad are both
row_sparse
. - multi_precision (bool, optional) –
Flag to control the internal precision of the optimizer.:
False: results in using the same precision as the weights (default), True: makes internal 32-bit copy of the weights and applies gradients in 32-bit precision even if actual weights used in the model have lower precision. Turning this on can improve convergence and accuracy when training with float16.
-
class
mxnet.optimizer.
SGLD
(**kwargs)[source]¶ Stochastic Gradient Riemannian Langevin Dynamics.
This class implements the optimizer described in the paper Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex, available at https://papers.nips.cc/paper/4883-stochastic-gradient-riemannian-langevin-dynamics-on-the-probability-simplex.pdf.
-
class
mxnet.optimizer.
Signum
(learning_rate=0.01, momentum=0.9, wd_lh=0.0, **kwargs)[source]¶ The Signum optimizer that takes the sign of gradient or momentum.
The optimizer updates the weight by:
rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight state = momentum * state + (1-momentum)*rescaled_grad weight = (1 - lr * wd_lh) * weight - lr * sign(state)
References
Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli & Anima Anandkumar. (2018). signSGD: Compressed Optimisation for Non-Convex Problems. In ICML‘18.
See: https://arxiv.org/abs/1802.04434
For details of the update algorithm see
signsgd_update
andsignum_update
.This optimizer accepts the following parameters in addition to those accepted by
Optimizer
.Parameters: - momentum (float, optional) – The momentum value.
- wd_lh (float, optional) – The amount of decoupled weight decay regularization, see details in the original paper at:https://arxiv.org/abs/1711.05101
-
class
mxnet.optimizer.
Updater
(optimizer)[source]¶ Updater for kvstore.
-
class
mxnet.optimizer.
ccSGD
(*args, **kwargs)[source]¶ [DEPRECATED] Same as SGD. Left here for backward compatibility.
-
mxnet.optimizer.
create
(name, **kwargs)¶ Instantiates an optimizer with a given name and kwargs.
Note
We can use the alias create for
Optimizer.create_optimizer
.Parameters: - name (str) – Name of the optimizer. Should be the name of a subclass of Optimizer. Case insensitive.
- kwargs (dict) – Parameters for the optimizer.
Returns: An instantiated optimizer.
Return type: Examples
>>> sgd = mx.optimizer.Optimizer.create_optimizer('sgd') >>> type(sgd)
>>> adam = mx.optimizer.create('adam', learning_rate=.1) >>> type(adam)
-
mxnet.optimizer.
get_updater
(optimizer)[source]¶ Returns a closure of the updater needed for kvstore.
Parameters: optimizer (Optimizer) – The optimizer. Returns: updater – The closure of the updater. Return type: function
-
mxnet.optimizer.
register
(klass)¶ Registers a new optimizer.
Once an optimizer is registered, we can create an instance of this optimizer with create_optimizer later.
Examples
>>> @mx.optimizer.Optimizer.register ... class MyOptimizer(mx.optimizer.Optimizer): ... pass >>> optim = mx.optimizer.Optimizer.create_optimizer('MyOptimizer') >>> print(type(optim))
Scheduling learning rate.
-
class
mxnet.lr_scheduler.
LRScheduler
(base_lr=0.01, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear')[source]¶ Base class of a learning rate scheduler.
A scheduler returns a new learning rate based on the number of updates that have been performed.
Parameters: - base_lr (float, optional) – The initial learning rate.
- warmup_steps (int) – number of warmup steps used before this scheduler starts decay
- warmup_begin_lr (float) – if using warmup, the learning rate from which it starts warming up
- warmup_mode (string) – warmup can be done in two modes. ‘linear’ mode gradually increases lr with each step in equal increments ‘constant’ mode keeps lr at warmup_begin_lr for warmup_steps
-
class
mxnet.lr_scheduler.
FactorScheduler
(step, factor=1, stop_factor_lr=1e-08, base_lr=0.01, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear')[source]¶ Reduce the learning rate by a factor for every n steps.
It returns a new learning rate by:
base_lr * pow(factor, floor(num_update/step))
Parameters: - step (int) – Changes the learning rate for every n updates.
- factor (float, optional) – The factor to change the learning rate.
- stop_factor_lr (float, optional) – Stop updating the learning rate if it is less than this value.
-
class
mxnet.lr_scheduler.
MultiFactorScheduler
(step, factor=1, base_lr=0.01, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear')[source]¶ Reduce the learning rate by given a list of steps.
Assume there exists k such that:
step[k] <= num_update and num_update < step[k+1]
Then calculate the new learning rate by:
base_lr * pow(factor, k+1)
Parameters: - step (list of int) – The list of steps to schedule a change
- factor (float) – The factor to change the learning rate.
- warmup_steps (int) – number of warmup steps used before this scheduler starts decay
- warmup_begin_lr (float) – if using warmup, the learning rate from which it starts warming up
- warmup_mode (string) – warmup can be done in two modes. ‘linear’ mode gradually increases lr with each step in equal increments ‘constant’ mode keeps lr at warmup_begin_lr for warmup_steps
-
class
mxnet.lr_scheduler.
PolyScheduler
(max_update, base_lr=0.01, pwr=2, final_lr=0, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear')[source]¶ Reduce the learning rate according to a polynomial of given power.
Calculate the new learning rate, after warmup if any, by:
final_lr + (start_lr - final_lr) * (1-nup/max_nup)^pwr if nup < max_nup, 0 otherwise.
Parameters: - max_update (int) – maximum number of updates before the decay reaches final learning rate.
- base_lr (float) – base learning rate to start from
- pwr (int) – power of the decay term as a function of the current number of updates.
- final_lr (float) – final learning rate after all steps
- warmup_steps (int) – number of warmup steps used before this scheduler starts decay
- warmup_begin_lr (float) – if using warmup, the learning rate from which it starts warming up
- warmup_mode (string) – warmup can be done in two modes. ‘linear’ mode gradually increases lr with each step in equal increments ‘constant’ mode keeps lr at warmup_begin_lr for warmup_steps
-
class
mxnet.lr_scheduler.
CosineScheduler
(max_update, base_lr=0.01, final_lr=0, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear')[source]¶ Reduce the learning rate according to a cosine function
Calculate the new learning rate by:
final_lr + (start_lr - final_lr) * (1+cos(pi * nup/max_nup))/2 if nup < max_nup, 0 otherwise.
Parameters: - max_update (int) – maximum number of updates before the decay reaches 0
- base_lr (float) – base learning rate
- final_lr (float) – final learning rate after all steps
- warmup_steps (int) – number of warmup steps used before this scheduler starts decay
- warmup_begin_lr (float) – if using warmup, the learning rate from which it starts warming up
- warmup_mode (string) – warmup can be done in two modes. ‘linear’ mode gradually increases lr with each step in equal increments ‘constant’ mode keeps lr at warmup_begin_lr for warmup_steps
Weight initializer.
-
class
mxnet.initializer.
InitDesc
[source]¶ Descriptor for the initialization pattern.
Parameters: - name (str) – Name of variable.
- attrs (dict of str to str) – Attributes of this variable taken from
Symbol.attr_dict
. - global_init (Initializer) – Global initializer to fallback to.
-
class
mxnet.initializer.
Initializer
(**kwargs)[source]¶ The base class of an initializer.
-
set_verbosity
(verbose=False, print_func=None)[source]¶ Switch on/off verbose mode
Parameters: - verbose (bool) – switch on/off verbose mode
- print_func (function) – A function that computes statistics of initialized arrays. Takes an NDArray and returns an str. Defaults to mean absolute value str((abs(x)/size(x)).asscalar()).
-
dumps
()[source]¶ Saves the initializer to string
Returns: JSON formatted string that describes the initializer. Return type: str Examples
>>> # Create initializer and retrieve its parameters ... >>> init = mx.init.Normal(0.5) >>> init.dumps() '["normal", {"sigma": 0.5}]' >>> init = mx.init.Xavier(factor_type="in", magnitude=2.34) >>> init.dumps() '["xavier", {"rnd_type": "uniform", "magnitude": 2.34, "factor_type": "in"}]'
-
-
mxnet.initializer.
register
(klass)[source]¶ Registers a custom initializer.
Custom initializers can be created by extending mx.init.Initializer and implementing the required functions like _init_weight and _init_bias. The created initializer must be registered using mx.init.register before it can be called by name.
Parameters: klass (class) – A subclass of mx.init.Initializer that needs to be registered as a custom initializer. Example
>>> # Create and register a custom initializer that ... # initializes weights to 0.1 and biases to 1. ... >>> @mx.init.register ... @alias('myinit') ... class CustomInit(mx.init.Initializer): ... def __init__(self): ... super(CustomInit, self).__init__() ... def _init_weight(self, _, arr): ... arr[:] = 0.1 ... def _init_bias(self, _, arr): ... arr[:] = 1 ... >>> # Module is an instance of 'mxnet.module.Module' ... >>> module.init_params("custominit") >>> # module.init_params("myinit") >>> # module.init_params(CustomInit())
-
class
mxnet.initializer.
Load
(param, default_init=None, verbose=False)[source]¶ Initializes variables by loading data from file or dict.
Note Load will drop
arg:
oraux:
from name and initialize the variables that match with the prefix dropped.Parameters: - param (str or dict of str->`NDArray`) – Parameter file or dict mapping name to NDArray.
- default_init (Initializer) – Default initializer when name is not found in param.
- verbose (bool) – Flag for enabling logging of source when initializing.
-
class
mxnet.initializer.
Mixed
(patterns, initializers)[source]¶ Initialize parameters using multiple initializers.
Parameters: - patterns (list of str) – List of regular expressions matching parameter names.
- initializers (list of Initializer) – List of initializers corresponding to patterns.
Example
>>> # Given 'module', an instance of 'mxnet.module.Module', initialize biases to zero ... # and every other parameter to random values with uniform distribution. ... >>> init = mx.initializer.Mixed(['bias', '.*'], [mx.init.Zero(), mx.init.Uniform(0.1)]) >>> module.init_params(init) >>> >>> for dictionary in module.get_params(): ... for key in dictionary: ... print(key) ... print(dictionary[key].asnumpy()) ... fullyconnected1_weight [[ 0.0097627 0.01856892 0.04303787]] fullyconnected1_bias [ 0.]
-
class
mxnet.initializer.
Zero
[source]¶ Initializes weights to zero.
Example
>>> # Given 'module', an instance of 'mxnet.module.Module', initialize weights to zero. ... >>> init = mx.initializer.Zero() >>> module.init_params(init) >>> for dictionary in module.get_params(): ... for key in dictionary: ... print(key) ... print(dictionary[key].asnumpy()) ... fullyconnected0_weight [[ 0. 0. 0.]]
-
class
mxnet.initializer.
One
[source]¶ Initializes weights to one.
Example
>>> # Given 'module', an instance of 'mxnet.module.Module', initialize weights to one. ... >>> init = mx.initializer.One() >>> module.init_params(init) >>> for dictionary in module.get_params(): ... for key in dictionary: ... print(key) ... print(dictionary[key].asnumpy()) ... fullyconnected0_weight [[ 1. 1. 1.]]
-
class
mxnet.initializer.
Constant
(value)[source]¶ Initializes the weights to a given value. The value passed in can be a scalar or a NDarray that matches the shape of the parameter to be set.
Parameters: value (float, NDArray) – Value to set.
-
class
mxnet.initializer.
Uniform
(scale=0.07)[source]¶ Initializes weights with random values uniformly sampled from a given range.
Parameters: scale (float, optional) – The bound on the range of the generated random values. Values are generated from the range [-scale, scale]. Default scale is 0.07. Example
>>> # Given 'module', an instance of 'mxnet.module.Module', initialize weights >>> # to random values uniformly sampled between -0.1 and 0.1. ... >>> init = mx.init.Uniform(0.1) >>> module.init_params(init) >>> for dictionary in module.get_params(): ... for key in dictionary: ... print(key) ... print(dictionary[key].asnumpy()) ... fullyconnected0_weight [[ 0.01360891 -0.02144304 0.08511933]]
-
class
mxnet.initializer.
Normal
(sigma=0.01)[source]¶ Initializes weights with random values sampled from a normal distribution with a mean of zero and standard deviation of sigma.
Parameters: sigma (float, optional) – Standard deviation of the normal distribution. Default standard deviation is 0.01. Example
>>> # Given 'module', an instance of 'mxnet.module.Module', initialize weights >>> # to random values sampled from a normal distribution. ... >>> init = mx.init.Normal(0.5) >>> module.init_params(init) >>> for dictionary in module.get_params(): ... for key in dictionary: ... print(key) ... print(dictionary[key].asnumpy()) ... fullyconnected0_weight [[-0.3214761 -0.12660924 0.53789419]]
-
class
mxnet.initializer.
Orthogonal
(scale=1.414, rand_type='uniform')[source]¶ Initialize weight as orthogonal matrix.
This initializer implements Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, available at https://arxiv.org/abs/1312.6120.
Parameters: - scale (float optional) – Scaling factor of weight.
- rand_type (string optional) – Use “uniform” or “normal” random number to initialize weight.
-
class
mxnet.initializer.
Xavier
(rnd_type='uniform', factor_type='avg', magnitude=3)[source]¶ Returns an initializer performing “Xavier” initialization for weights.
This initializer is designed to keep the scale of gradients roughly the same in all layers.
By default, rnd_type is
'uniform'
and factor_type is'avg'
, the initializer fills the weights with random numbers in the range of \([-c, c]\), where \(c = \sqrt{\frac{3.}{0.5 * (n_{in} + n_{out})}}\). \(n_{in}\) is the number of neurons feeding into weights, and \(n_{out}\) is the number of neurons the result is fed to.If rnd_type is
'uniform'
and factor_type is'in'
, the \(c = \sqrt{\frac{3.}{n_{in}}}\). Similarly when factor_type is'out'
, the \(c = \sqrt{\frac{3.}{n_{out}}}\).If rnd_type is
'gaussian'
and factor_type is'avg'
, the initializer fills the weights with numbers from normal distribution with a standard deviation of \(\sqrt{\frac{3.}{0.5 * (n_{in} + n_{out})}}\).Parameters: - rnd_type (str, optional) – Random generator type, can be
'gaussian'
or'uniform'
. - factor_type (str, optional) – Can be
'avg'
,'in'
, or'out'
. - magnitude (float, optional) – Scale of random number.
- rnd_type (str, optional) – Random generator type, can be
-
class
mxnet.initializer.
MSRAPrelu
(factor_type='avg', slope=0.25)[source]¶ Initialize the weight according to a MSRA paper.
This initializer implements Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, available at https://arxiv.org/abs/1502.01852.
This initializer is proposed for initialization related to ReLu activation, it maked some changes on top of Xavier method.
Parameters: - factor_type (str, optional) – Can be
'avg'
,'in'
, or'out'
. - slope (float, optional) – initial slope of any PReLU (or similar) nonlinearities.
- factor_type (str, optional) – Can be
-
class
mxnet.initializer.
LSTMBias
(forget_bias=1.0)[source]¶ Initialize all biases of an LSTMCell to 0.0 except for the forget gate whose bias is set to custom value.
Parameters: forget_bias (float, default 1.0) – bias for the forget gate. Jozefowicz et al. 2015 recommends setting this to 1.0.
-
class
mxnet.initializer.
FusedRNN
(init, num_hidden, num_layers, mode, bidirectional=False, forget_bias=1.0)[source]¶ Initialize parameters for fused rnn layers.
Parameters: - init (Initializer) – initializer applied to unpacked weights. Fall back to global initializer if None.
- num_hidden (int) – should be the same with arguments passed to FusedRNNCell.
- num_layers (int) – should be the same with arguments passed to FusedRNNCell.
- mode (str) – should be the same with arguments passed to FusedRNNCell.
- bidirectional (bool) – should be the same with arguments passed to FusedRNNCell.
- forget_bias (float) – should be the same with arguments passed to FusedRNNCell.