
Gluon provides pre-defined loss functions in the mxnet.gluon.loss module.

losses for training neural networks


Loss(weight, batch_axis, **kwargs)

Base class for loss.

L2Loss([weight, batch_axis])

Calculates the mean squared error between label and pred.

L1Loss([weight, batch_axis])

Calculates the mean absolute error between label and pred.


The cross-entropy loss for binary classification.


The cross-entropy loss for binary classification.

SoftmaxCrossEntropyLoss([axis, …])

Computes the softmax cross entropy loss.


Computes the softmax cross entropy loss.

KLDivLoss([from_logits, axis, weight, …])

The Kullback-Leibler divergence loss.

CTCLoss([layout, label_layout, weight])

Connectionist Temporal Classification Loss.

HuberLoss([rho, weight, batch_axis])

Calculates smoothed L1 loss that is equal to L1 loss if absolute error exceeds rho but is equal to L2 loss otherwise.

HingeLoss([margin, weight, batch_axis])

Calculates the hinge loss function often used in SVMs:

SquaredHingeLoss([margin, weight, batch_axis])

Calculates the soft-margin loss function used in SVMs:

LogisticLoss([weight, batch_axis, label_format])

Calculates the logistic loss (for binary losses only):

TripletLoss([margin, weight, batch_axis])

Calculates triplet loss given three input tensors and a positive margin.

PoissonNLLLoss([weight, from_logits, …])

For a target (Random Variable) in a Poisson distribution, the function calculates the Negative Log likelihood loss.

CosineEmbeddingLoss([weight, batch_axis, margin])

For a target label 1 or -1, vectors input1 and input2, the function computes the cosine distance between the vectors.

SDMLLoss([smoothing_parameter, weight, …])

Calculates Batchwise Smoothed Deep Metric Learning (SDML) Loss given two input tensors and a smoothing weight SDM Loss learns similarity between paired samples by using unpaired samples in the minibatch as potential negative examples.

class Loss(weight, batch_axis, **kwargs)[source]

Bases: mxnet.gluon.block.HybridBlock

Base class for loss.

  • weight (float or None) – Global scalar weight for loss.

class L2Loss(weight=1.0, batch_axis=0, **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

Calculates the mean squared error between label and pred.




class L1Loss(weight=None, batch_axis=0, **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

Calculates the mean absolute error between label and pred.




class SigmoidBinaryCrossEntropyLoss(from_sigmoid=False, weight=None, batch_axis=0, **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

The cross-entropy loss for binary classification. (alias: SigmoidBCELoss)

BCE loss is useful when training logistic regression. If from_sigmoid is False (default), this loss computes:




Returns this Block’s parameter dictionary (does not include its children’s parameters).



alias of mxnet.gluon.loss.SigmoidBinaryCrossEntropyLoss

class SoftmaxCrossEntropyLoss(axis=-1, sparse_label=True, from_logits=False, weight=None, batch_axis=0, **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

Computes the softmax cross entropy loss. (alias: SoftmaxCELoss)

If sparse_label is True (default), label should contain integer category indicators:




Returns this Block’s parameter dictionary (does not include its children’s parameters).



alias of mxnet.gluon.loss.SoftmaxCrossEntropyLoss

class KLDivLoss(from_logits=True, axis=-1, weight=None, batch_axis=0, **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

The Kullback-Leibler divergence loss.

KL divergence measures the distance between contiguous distributions. It can be used to minimize information loss when approximating a distribution. If from_logits is True (default), loss is defined as:




class CTCLoss(layout='NTC', label_layout='NT', weight=None, **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

Connectionist Temporal Classification Loss.

  • layout (str, default 'NTC') – Layout of prediction tensor. ‘N’, ‘T’, ‘C’ stands for batch size, sequence length, and alphabet_size respectively.

  • label_layout (str, default 'NT') – Layout of the labels. ‘N’, ‘T’ stands for batch size, and sequence length respectively.

  • weight (float or None) – Global scalar weight for loss.



class HuberLoss(rho=1, weight=None, batch_axis=0, **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

Calculates smoothed L1 loss that is equal to L1 loss if absolute error exceeds rho but is equal to L2 loss otherwise. Also called SmoothedL1 loss.

L=i{12rho(labelipredi)2 if |labelipredi|<rho|labelipredi|rho2 otherwise 



label and pred can have arbitrary shape as long as they have the same number of elements.

  • rho (float, default 1) – Threshold for trimmed mean estimator.

  • weight (float or None) – Global scalar weight for loss.

  • batch_axis (int, default 0) – The axis that represents mini-batch.

  • pred: prediction tensor with arbitrary shape

  • label: target tensor with the same size as pred.

  • sample_weight: element-wise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).

  • loss: loss tensor with shape (batch_size,). Dimenions other than batch_axis are averaged out.


forward(pred, label, sample_weight=None)[source]

class HingeLoss(margin=1, weight=None, batch_axis=0, **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

Calculates the hinge loss function often used in SVMs:




Returns this Block’s parameter dictionary (does not include its children’s parameters).

where pred is the classifier prediction and label is the target tensor containing values -1 or 1. label and pred must have the same number of elements.

  • margin (float) – The margin in hinge loss. Defaults to 1.0

  • weight (float or None) – Global scalar weight for loss.

  • batch_axis (int, default 0) – The axis that represents mini-batch.

  • pred: prediction tensor with arbitrary shape.

  • label: truth tensor with values -1 or 1. Must have the same size as pred.

  • sample_weight: element-wise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).

  • loss: loss tensor with shape (batch_size,). Dimenions other than batch_axis are averaged out.


forward(pred, label, sample_weight=None)[source]

class SquaredHingeLoss(margin=1, weight=None, batch_axis=0, **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

Calculates the soft-margin loss function used in SVMs:




Returns this Block’s parameter dictionary (does not include its children’s parameters).

where pred is the classifier prediction and label is the target tensor containing values -1 or 1. label and pred can have arbitrary shape as long as they have the same number of elements.

  • margin (float) – The margin in hinge loss. Defaults to 1.0

  • weight (float or None) – Global scalar weight for loss.

  • batch_axis (int, default 0) – The axis that represents mini-batch.

  • pred: prediction tensor with arbitrary shape

  • label: truth tensor with values -1 or 1. Must have the same size as pred.

  • sample_weight: element-wise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).

  • loss: loss tensor with shape (batch_size,). Dimenions other than batch_axis are averaged out.


forward(pred, label, sample_weight=None)[source]

class LogisticLoss(weight=None, batch_axis=0, label_format='signed', **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

Calculates the logistic loss (for binary losses only):




Returns this Block’s parameter dictionary (does not include its children’s parameters).

where pred is the classifier prediction and label is the target tensor containing values -1 or 1 (0 or 1 if label_format is binary). label and pred can have arbitrary shape as long as they have the same number of elements.

  • weight (float or None) – Global scalar weight for loss.

  • batch_axis (int, default 0) – The axis that represents mini-batch.

  • label_format (str, default 'signed') – Can be either ‘signed’ or ‘binary’. If the label_format is ‘signed’, all label values should be either -1 or 1. If the label_format is ‘binary’, all label values should be either 0 or 1.

  • Inputs

    • pred: prediction tensor with arbitrary shape.

    • label: truth tensor with values -1/1 (label_format is ‘signed’) or 0/1 (label_format is ‘binary’). Must have the same size as pred.

    • sample_weight: element-wise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).

  • Outputs

    • loss: loss tensor with shape (batch_size,). Dimenions other than batch_axis are averaged out.


forward(pred, label, sample_weight=None)[source]

class TripletLoss(margin=1, weight=None, batch_axis=0, **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

Calculates triplet loss given three input tensors and a positive margin. Triplet loss measures the relative similarity between a positive example, a negative example, and prediction:




Returns this Block’s parameter dictionary (does not include its children’s parameters).

positive, negative, and ‘pred’ can have arbitrary shape as long as they have the same number of elements.

  • margin (float) – Margin of separation between correct and incorrect pair.

  • weight (float or None) – Global scalar weight for loss.

  • batch_axis (int, default 0) – The axis that represents mini-batch.

  • pred: prediction tensor with arbitrary shape

  • positive: positive example tensor with arbitrary shape. Must have the same size as pred.

  • negative: negative example tensor with arbitrary shape Must have the same size as pred.

  • loss: loss tensor with shape (batch_size,).


forward(pred, positive, negative, sample_weight=None)[source]

class PoissonNLLLoss(weight=None, from_logits=True, batch_axis=0, compute_full=False, **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

For a target (Random Variable) in a Poisson distribution, the function calculates the Negative Log likelihood loss. PoissonNLLLoss measures the loss accrued from a poisson regression prediction made by the model.

L = \text{pred} - \text{target} * \log(\text{pred}) +\log(\text{target!})



Returns this Block’s parameter dictionary (does not include its children’s parameters).

target, ‘pred’ can have arbitrary shape as long as they have the same number of elements.

  • from_logits (boolean, default True) – indicating whether log(predicted) value has already been computed. If True, the loss is computed as \exp(\text{pred}) - \text{target} * \text{pred}, and if False, then loss is computed as \text{pred} - \text{target} * \log(\text{pred}+\text{epsilon}).The default value

  • weight (float or None) – Global scalar weight for loss.

  • batch_axis (int, default 0) – The axis that represents mini-batch.

  • compute_full (boolean, default False) – Indicates whether to add an approximation(Stirling factor) for the Factorial term in the formula for the loss. The Stirling factor is: \text{target} * \log(\text{target}) - \text{target} + 0.5 * \log(2 * \pi * \text{target})

  • epsilon (float, default 1e-08) – This is to avoid calculating log(0) which is not defined.

  • pred: Predicted value

  • target: Random variable(count or number) which belongs to a Poisson distribution.

  • sample_weight: element-wise weighting tensor. Must be broadcastable to the same shape as pred. For example, if pred has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).

  • loss: Average loss (shape=(1,1)) of the loss tensor with shape (batch_size,).


forward(pred, target, sample_weight=None, epsilon=1e-08)[source]

class CosineEmbeddingLoss(weight=None, batch_axis=0, margin=0, **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

For a target label 1 or -1, vectors input1 and input2, the function computes the cosine distance between the vectors. This can be interpreted as how similar/dissimilar two input vectors are.

\begin{split}L = \sum_i \begin{cases} 1 - {cos\_sim({input1}_i, {input2}_i)} & \text{ if } {label}_i = 1\\ {cos\_sim({input1}_i, {input2}_i)} & \text{ if } {label}_i = -1 \end{cases}\\ cos\_sim(input1, input2) = \frac{{input1}_i.{input2}_i}{||{input1}_i||.||{input2}_i||}\end{split}



Returns this Block’s parameter dictionary (does not include its children’s parameters).

input1, input2 can have arbitrary shape as long as they have the same number of elements.

  • weight (float or None) – Global scalar weight for loss.

  • batch_axis (int, default 0) – The axis that represents mini-batch.

  • margin (float) – Margin of separation between correct and incorrect pair.

  • input1: a tensor with arbitrary shape

  • input2: another tensor with same shape as pred to which input1 is compared for similarity and loss calculation

  • label: A 1-D tensor indicating for each pair input1 and input2, target label is 1 or -1

  • sample_weight: element-wise weighting tensor. Must be broadcastable to the same shape as input1. For example, if input1 has shape (64, 10) and you want to weigh each sample in the batch separately, sample_weight should have shape (64, 1).

  • loss: The loss tensor with shape (batch_size,).


forward(input1, input2, label, sample_weight=None)[source]

class SDMLLoss(smoothing_parameter=0.3, weight=1.0, batch_axis=0, **kwargs)[source]

Bases: mxnet.gluon.loss.Loss

Calculates Batchwise Smoothed Deep Metric Learning (SDML) Loss given two input tensors and a smoothing weight SDM Loss learns similarity between paired samples by using unpaired samples in the minibatch as potential negative examples.

The loss is described in greater detail in “Large Scale Question Paraphrase Retrieval with Smoothed Deep Metric Learning.” - by Bonadiman, Daniele, Anjishnu Kumar, and Arpit Mittal. arXiv preprint arXiv:1905.12786 (2019). URL:

According to the authors, this loss formulation achieves comparable or higher accuracy to Triplet Loss but converges much faster. The loss assumes that the items in both tensors in each minibatch are aligned such that x1[0] corresponds to x2[0] and all other datapoints in the minibatch are unrelated. x1 and x2 are minibatches of vectors.

  • smoothing_parameter (float) – Probability mass to be distributed over the minibatch. Must be < 1.0.

  • weight (float or None) – Global scalar weight for loss.

  • batch_axis (int, default 0) – The axis that represents mini-batch.

  • Inputs

    • x1: Minibatch of data points with shape (batch_size, vector_dim)

    • x2: Minibatch of data points with shape (batch_size, vector_dim) Each item in x2 is a positive sample for the same index in x1. That is, x1[0] and x2[0] form a positive pair, x1[1] and x2[1] form a positive pair - and so on. All data points in different rows should be decorrelated

  • Outputs

    • loss: loss tensor with shape (batch_size,).



Returns this Block’s parameter dictionary (does not include its children’s parameters).


forward(x1, x2)[source]

the function computes the kl divergence between the negative distances (internally it compute a softmax casting into probabilities) and the identity matrix.

This assumes that the two batches are aligned therefore the more similar vector should be the one having the same id.

Batch1 Batch2

President of France French President President of US American President

Given the question president of France in batch 1 the model will learn to predict french president comparing it with all the other vectors in batch 2

