Technical

19 Jul 2023|69 MIN READ

Cross-Entropy is All You Need… Or is It?

Author

Chady Dimachkie

Machine Learning Engineer

In the rapidly evolving field of machine learning, deep neural networks have proven to be powerful in solving complex tasks across various domains. However, the impressive performance of these models comes at a price — the need for large-scale, accurately annotated datasets. Unfortunately, these datasets are often prone to noisy labels, which can significantly hinder the performance of trained models.

Inspired by previous research, we present a loss function called “Smooth Generalized Cross-Entropy”. This loss function not only addresses the issue of noisy labels in training but also enhances model confidence calibration and helps with training regularization.

The primary objective of the proposed loss function is to improve the training of neural network models and we’ll be showing this on the task of Named Entity Recognition (NER), but you can apply the same principle to any other classification task. By leveraging a theoretically grounded set of functions, the Cross-Entropy (CE) and the Mean Absolute Error (MAE), we introduce a flexible method for noise robust training. On top of this, we apply label smoothing so we can calibrate the model as a further method for regularization which is also crucial for reliable uncertainty estimation. Our loss function can be seamlessly integrated into existing architecture, offering a practical and effective solution for a wide range of noisy data scenarios.

In this article, we will delve into the details of the Smooth Generalized Cross-Entropy (SGCE) loss function, outlining its theoretical foundations by explaining why it is needed and how it works. To do this, we will be training a distilbert-based model on the public dataset CoNLL-2003. You’ll find the full code in the Colab notebook here.

Let’s get started!


A bit of context

Training a model requires collecting a lot of raw data, from many sources, part of it can also be generated data or even related data that isn’t precisely your domain but could help the model get more general knowledge so it can better generalize. Getting the data labeled requires maintaining a labeling pipeline where people need to be trained for the task at hand and where some people are better than others which can lead to a source of mistakes or quality imbalance in the data and hence added noise. Similarly, another source of labeled data can be gathered from pseudo-labeling where you use your own models (which can add bias) or even external models (foundation models, other models fine-tuned on similar enough data, etc) and figure out a final consensus to reduce the final amount of noise.

Even with lots of precautions on your data setup, you will encounter noise that will prevent you from reaching the target accuracy you are aiming for.

At Ntropy, we’ve had the above issues where collecting transactional data from various sources leads to the model learning different biases. Transactional data can be hard to label since it involves finding ambiguous entities and many times there could be several answers which can lead to added noise if the labeling source isn’t diving deep into all the possibilities.

Now, what could be something else that can contribute to amplifying bias and/or noise? The loss function itself!

The Standard: Cross-Entropy (CE) Loss

The go-to loss when training a Transformer model nowadays is Cross-Entropy (CE), look at the HuggingFace Trainer’s code, it’s the default and only option for classification tasks which models return by default when you run their forward function. No other loss is available out of the box for this task.

Fortunately, you can still override the compute_loss function of the Trainer and implement your own loss which we’ll see later.

Now, what does cross-entropy do? It measures the dissimilarity between the predicted probabilities of the model and the true probabilities of the labels which you can use to understand where error happens in a model’s predictions. Cross-Entropy is simply written as the following:

Cross-Entropy (CE) loss
Cross-Entropy (CE) loss

Now, let’s look at its derivative (with respect to θ) which is the following:

Derivative of the Cross-Entropy (CE) loss with respect to θ
Derivative of the Cross-Entropy (CE) loss with respect to θ

All is good until you notice this one term:

Weighting term in the derivative of the Cross-Entropy (CE) loss
Weighting term in the derivative of the Cross-Entropy (CE) loss

If you’ve been following what is being said regarding noise at the beginning, this term should come as quite the shocker — it is a weighting term!

What this weighting term does is basically treat samples from our dataset differently. This is a good thing when your model is doing great on average or easy samples but struggling to converge on harder samples which will make the loss focus more on the hardest examples.

All is good until you remember that no production dataset is perfectly unbiased or noise-free. And what are noisy examples given to our model? They are hard to learn because they can go against the true distribution you’re trying to learn. Meaning we don’t want the loss to focus on the noise and try to make the model learn them at all costs!

Ok, now that you’re scared for your data, what do we do? 😱

Let’s try the opposite approach, one that preaches for a utopia of equal treatment for our training samples, and see what happens!

An Equality Utopia: Mean Absolute Error (MAE) Loss

How do we make a loss that will treat our samples equally? Look no further than the Mean Absolute Error (MAE) loss!

Let’s see what it looks like:

The Mean Absolute Error (MAE) loss
The Mean Absolute Error (MAE) loss

Note: You should note that MAE is equivalent to the Unhinged Loss up to a constant of proportionality. So we’ll use the Unhinged Loss instead for simplicity since they achieve the same thing but we’ll keep calling it Mean Absolute Error (MAE).

What this loss does is quite simple: it computes the distance between your sample and the ground truth.

Let’s look at its derivative:

Derivative of the Unhinged/MAE Loss
Derivative of the Unhinged/MAE Loss

What we notice right away is the missing weighting term!

Both loss functions’ derivatives
Both loss functions’ derivatives

Again, we notice the simplicity and we see that the final derivative of the MAE loss is just going in the direction of the model’s success or mistake without any treatment that could favor any sample whatsoever.

This means that this loss is, in theory, noise-robust since it’ll be much less affected by noise than the CE loss!


The Hard Truth

The MAE loss seems to solve the issue in the context of noise, but the reality is quite different:

Indeed, the MAE loss actually performs much worse in most contexts, including noise scenarios, compared to the CE loss

That’s a bummer.

But think about it, an important question to ask is: What are the assumptions that you take when you want to train on a dataset?

The CE loss takes the ideal assumption that the data is perfect.

Perfect means:

  • In theory: Your data is free from the bias and noise that could be detrimental to your given task (other biases could exist that don’t harm the specific task you’re trying to solve of course)
  • In practice: You applied as much data cleaning as possible so that it’s a clean enough dataset. Clean enough always remains to be defined 😀

And in reality, this is a good idea. The model will learn the easy samples quickly, giving it a foundational understanding of your task, and then it’ll struggle on the hard samples and the loss will focus on learning these later.

As for the MAE loss, it takes no assumption about where your model should focus and it’ll try to learn all samples at the same time, most of the time taking much longer to converge if it ever does. Imagine yourself trying to learn a new topic and focus on advanced exercises as much as beginner-level ones. You’ll confuse yourself and take much longer to learn and you’ll notice you need a basic understanding of the topic anyway to learn the advanced samples meaning there is an order to learn new things. The same idea roughly applies to your model.

A Compromise: Generalized Cross-Entropy (GCE) Loss

Both of the above loss functions have pros and cons that we can use in different contexts. So let’s just combine them!

Based on the idea from the paper “Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels”, we can define a loss function that merges both the above functions into one.

The Generalized Cross-Entropy (GCE) loss is defined as follows:

Generalized Cross-Entropy (GCE) Loss with its q parameter
Generalized Cross-Entropy (GCE) Loss with its q parameter

Well, technically this function comes from page 214 of “An Analysis of Transformations” by D. R. Cox and G. E. P. Box (did The Long-Lost Brothers finally find each other outside of the Opera too?😀) in the Journal of the Royal Statistical Society. Series B (Methodological), Vol. 26, №2. (1964). It is actually the negation of the Box-Cox transformation.

And now its derivative, as much as we have been enjoying looking at them so far 😀:

Derivative of the Generalized Cross-Entropy (GCE) Loss with its Tunable Weighing term!
Derivative of the Generalized Cross-Entropy (GCE) Loss with its Tunable Weighing term!

We can clearly see that the proposed loss function is equivalent to CE for q -> 0, and becomes the MAE loss when q -> 1.

This is indeed a generalization of both the CE and MAE loss since it combines both their properties in a weighted fashion.

You’ll notice the q parameter which actually means: How much of the CE discrimination do you want VS How much noise-robustness the MAE provides?

Or in other words: What is, roughly, the estimated noise level of your current dataset?

This is quite the compromise since we can afford a slightly longer convergence when there is a lot of noise and it is all tunable!


Let’s implement it

We’re going to test this loss in practice by implementing it and trying it out on the public dataset CoNLL-2003. You can also try it out on your own dataset.

A possible implementation in PyTorch could be the following (You’ll find the full notebook to try it out yourself, here):

class GCELoss(nn.Module): def __init__(self, q=0.5, ignore_index=-100): super(GCELoss, self).__init__() self.q = q self.ignore_index = ignore_index self.eps = 1e-8 def forward(self, logits, targets): valid_idx = targets != self.ignore_index logits = logits[valid_idx] + self.eps targets = targets[valid_idx] # Optimization: use PyTorch's Cross-Entropy when q = 0 (much faster) if self.q == 0: # Single-class case if logits.size(-1) == 1: ce_loss = nn.BCEWithLogitsLoss(reduction='none') loss = ce_loss(logits.view(-1), targets.float()) else: # Multi-class case ce_loss = nn.CrossEntropyLoss(ignore_index=self.ignore_index, reduction='none') loss = ce_loss(logits, targets) loss = (loss.view(-1) + self.eps).mean() else: # The actual Generalized Cross-Entropy Loss in case q > 0 # Handle the same cases as above if logits.size(-1) == 1: pred = torch.sigmoid(logits) pred = torch.cat((1 - pred, pred), dim=-1) else: pred = F.softmax(logits, dim=-1) # The numerator part of the GCE loss numerator = (pred ** self.q) # Make the numerator more numerically stable (this is not in the original loss) pred = (numerator + self.eps).where(pred >= 0, -((pred.abs() ** self.q) + self.eps)) # The denominator part of the GCE loss loss = (1. - pred) / (self.q + self.eps) # Mean reduction to collapse the loss into a single number loss = (loss + self.eps).mean() return loss

The GCELoss class inherits from the nn.Module class, which is a base class for all PyTorch modules. The class takes two parameters during initialization:

  • q: Controls the estimated noise level
  • ignore_index: Defaults to -100 and ignores the contributions of special tokens like padding, etc

We then implement the loss and incorporate an optimization for the q = 0 case so we use the optimized and pre-existing Cross-Entropy implementation from PyTorch. We also make the loss more numerically stable for some cases where it could output NaN.

Training a model

All the code is available here so you can reproduce all the experiments.

The model is trained in several contexts, from no noise (uniform noise) added to all-you-can-eat noise with the n parameter that will represent the noise level, and we try several values for the q parameter in all those contexts.

Generalized Cross-Entropy (GCE) Training Loss for the loss parameter q ∈ [0.2, 0.4, 0.6, 0.8]and the data noise level n ∈ [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
Generalized Cross-Entropy (GCE) Training Loss for the loss parameter q ∈ [0.2, 0.4, 0.6, 0.8]and the data noise level n ∈ [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

Note: The seed is the same for every run, so the loss will have a similar behavior and shape across different  q values, which enables better comparison between runs.

From what we can see, the loss falls very quickly and then steadily goes down when not much noise is involved. Furthermore, the noisier the conditions get  the slower the loss goes down.

Now let’s add the Cross-Entropy loss to the graph and compare:

GCE, CE and MAE Training Losses for the loss parameter q ∈ [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]and the data noise level n ∈ [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
GCE, CE and MAE Training Losses for the loss parameter q ∈ [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]and the data noise level n ∈ [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

The scale of both losses is quite different and we can see that the CE loss has a bigger dynamic range. This might actually be better in terms of numerical stability, but we haven’t noticed any issues  regarding this while training. In the end, the difference is only in scale since the behavior and shape of the loss remain the same.

In the future, it would be interesting to experiment with how this loss scaling difference  shifts the distribution of gradients, especially since we are dealing with floating points that are prone to numerical instability, and using their dynamic range effectively is essential to avoid overflows (and underflows).

Testing the model

Now, you might wonder:

But, did we just forget all about the MAE (q=1.0) loss in this comparison?

Well not really, but since all the models produced by this loss never end up converging in every scenario we tried… oops…

The reality is that it is actually taking much longer for this loss to actually converge compared to any other loss that introduces even the slightest bias, meaning q < 1.0.

Still, let’s look at the final test set f1 scores (including MAE’s results😉) with all the range of values we have tested so far:

Test set F1 scores for loss parameter q ∈ [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]and data noise level n ∈ [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
Test set F1 scores for loss parameter q ∈ [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]and data noise level n ∈ [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

We can now start making some conclusions about the usage of the q parameter of this loss. It seems that when our dataset is clean and doesn’t contain any untrusted sources, we can just rely on the good old Cross-Entropy loss that remains strong. However, the GCE loss still matches it very closely for q values below 0.6 included.

However, things change quite drastically under noisy conditions where we can see that the CE loss doesn’t perform as well and we start needing bigger values of q to deal with the noise. Interestingly, if you have statistically significant noise, using bigger values of q directly seems to work best. Meaning that q isn’t exactly linearly proportional to the noise amount n as we would have expected. These results might be different under different conditions of noise, so it is a good idea to optimize for the best q parameter while training. Finally, whenever the GCE loss approximates the MAE loss, anything it produces does not converge in a reasonable time.

Overall, starting from 20% noise, we start seeing somegains ranging from 0.5 points to as much as 6+ points under the heaviest noise scenarios without any other changes than using the GCE loss.


The Hard Truth, Again

The Generalized Cross-Entropy offers quite a nice bump in accuracy under noisy conditions, which is representative of most real-world scenarios, and we just need to swap the CE loss for the GCE loss, sounds very convenient.

So what could we be unhappy about this time?

Model confidence calibration!

Model confidence calibration is something that isn’t often mentioned unfortunately but is actually very important, especially in production environments where the end users using your models will expect to take decisions based on your model’s predictions. Even in the original Generalized Cross-Entropy paper they never mention calibration! Quite often deep neural networks can hallucinate and sometimes it can look believable!

Model calibration is something that isn’t often mentioned unfortunately but is actually very important!

Knowing this, we’re going to explore how (un)calibrated our trained models are using reliability plots on the CoNLL-2003 dataset.

Let’s first look at reliability plots since they are the easiest to understand. If the model outputs 0.5 confidence for its output, then the accuracy should be 0.5. If it outputs 1.0 then it should be a perfect answer from the model with accuracy 1.0.

The plot below shows the model trained with n=0.6 noise, so a heavy noise scenario:

Cross-Entropy loss reliability plots (10 bins) for both test and validation sets under the n=0.6 noise condition
Cross-Entropy loss reliability plots (10 bins) for both test and validation sets under the n=0.6 noise condition

This graph should show confidence should be equal to accuracy (but it doesn’t in this case).

Let’s explain it a bit more deeply, here you can see 10 red bars marked as “Gap”, 10 (or fewer) blue bars marked as “Output”, and a diagonal gray line:

  • The blue bars are the model’s confidence correctness at a specific accuracy level
  • The red bars represent what the model is missing compared to an ideally calibrated model
  • The diagonal gray line represents where all the blue bars, at each confidence mark, should stop to represent a perfectly calibrated model

Let’s take another example to make this as clear as possible. If we look at the confidence axis at the blue bar on the 0.5 mark, it shows that almost every time the model outputs an answer with a confidence of 0.5, it is actually a perfect answer of 1.0 accuracy, same for 0.3 confidence until 0.7. And after this, we only see red bars, which means the model never actually outputs anything with a confidence above 0.7 even when the answer it outputs is fully correct.

This is very bad! It means the model actually scores very well but we can’t be sure in a production environment if it keeps saying a random confidence level between 0.3 and 0.7. It’s basically asking the model “Are you sure of your output?” and then it just replies “Meh 🤷” most of the time, even when it’s a correct answer. The model lacks confidence in itself!

Now that the problem is clear, let’s explore it even further.

Looking back at the above graph, we can also compute the Expected Calibration Error (ECE) and the Maximum Calibration Error (MCE) metrics. These metrics measure respectively the expected and worst-case calibration error, which is the difference between expected accuracy and expected confidence.

For the test set, we have an ECE of 0.56 and an MCE of 0.60. The lower, the better, so we can still do better.


Can Generalized Cross-Entropy help here?

From what we looked at, the Generalized Cross-Entropy loss is able to regularize training much better under noisy conditions, so in theory, it could provide more stable logits when training a model. Let’s look at the reliability plots to confirm this:

Reliability plots (10 bins) for both test and validation sets for the Generalized Cross-Entropy loss under the n = 0.6 noise conditions and loss parameter q ∈ [0.6, 0.8]
Reliability plots (10 bins) for both test and validation sets for the Generalized Cross-Entropy loss under the n = 0.6 noise conditions and loss parameter q ∈ [0.6, 0.8]

The plots look much better, although not perfect, we can see that we’re actually trying to aim for the gray diagonal line of calibration and confidence now makes much more sense compared to the Cross-Entropy loss.

Similarly, let’s look at the ECE and MCE values:

  • For test, n=0.6 and q=0.6: ECE is 0.11 and MCE is 0.21
  • For test, n=0.6 and q=0.8: ECE is 0.023 and MCE is 0.30
  • For validation, n=0.6 and q=0.6: ECE is  0.11 and MCE is 0.25
  • For validation, n=0.6 and q=0.8: ECE is 0.013 and MCE: 0.24

Clearly, the model is much better calibrated using the Generalized Cross-Entropy loss. The expected error is reduced by 24x and the maximum error by a factor of 2–3x. We also notice that a higher q value seems to help for both test and validation calibration meaning the MAE properties seem to regularize the logits during training as we expected.

Extending the loss: Smooth Generalized Cross-Entropy (S-GCE)

The above plots and metrics look significantly better, but why stop here?

Let’s extend the Generalized Cross-Entropy loss with another regularization mechanism that should help it, label smoothing!

Label Smoothing

Smoothing the labels is the act of softening the targets you want your model to match. So, label smoothing prevents the neural network from becoming over-confident when it makes a prediction. But why is that?

Take a look at this beautiful example drawing depicting a neural network making a prediction on transactional data:

Beautiful drawing depicting a neural network making a prediction on a Named Entity Recognition (NER) task
Beautiful drawing depicting a neural network making a prediction on a Named Entity Recognition (NER) task

A bank transaction can contain several entities that could be an organization, a location, and more.

Here, each bar and its color represent the current probability output by the model and the label for its current prediction. The model is also trained to maximize the probability of the most likely label of a token, so it’ll naturally just assign a high probability for labels that it thinks best match the current sentence.

An issue with this is that the model will be trained to be overconfident since this is the training objective. You’d need to have a near-perfect dataset, balanced and with good labels so that the model learns that 50% of the time it is one label versus the other which will balance the probabilities giving a better calibrated model. But this is often not the case, especially in large scale datasets, and anything that disrupts this good balance in a dataset will cause the model to deviate from the calibration we want.

Now let’s imagine we could spread out the probabilities like this:

2 cases of probabilities being spread out with a different smoothness slope
2 cases of probabilities being spread out with a different smoothness slope

Spreading out the probabilities will actually smooth them. This means that, during training, instead of always trying to predict the maximum probability for the most likely final label, we’ll allow some uncertainty that will increase the probabilities for every other label by a little bit, keeping a high probability for the most likely one, but not a 100% probability anymore.

Also, in the case of Named Entity Recognition (NER), we can smooth along spans as demonstrated above since sometimes even the labeled data we have isn’t sure when an entity starts or stops. So to help with this, we allow the probabilities to spread and contaminate the tokens directly around it. It’d be possible to spread across multiple tokens, but this seems only useful in domains where the data is very ambiguous.

Implementation

Here’s the fully commented code to extend the loss with the above approach (you’ll find the whole notebook to try it out yourself, here):

class SGCELoss(nn.Module): def __init__(self, q=0.5, smoothing=0.1, ignore_index=-100): super(SGCELoss, self).__init__() self.q = q self.smoothing = smoothing self.ignore_index = ignore_index self.eps = 1e-8 def forward(self, logits, targets): valid_idx = targets != self.ignore_index logits = logits[valid_idx] + self.eps targets = targets[valid_idx] # Optimization: use PyTorch's Cross-Entropy when q = 0 (much faster) if self.q == 0: # Single-class case if logits.size(-1) == 1: ce_loss = nn.BCEWithLogitsLoss(reduction='none') loss = ce_loss(logits.view(-1), targets.float()) else: # Multi-class case # PyTorch implements label smoothing for the CE loss by default ce_loss = nn.CrossEntropyLoss(ignore_index=self.ignore_index, reduction='none', label_smoothing=self.smoothing) loss = ce_loss(logits, targets) loss = (loss.view(-1) + self.eps).mean() else: # The actual Smooth Generalized Cross-Entropy Loss in case q > 0 # Handle the same cases as above if logits.size(-1) == 1: pred = torch.sigmoid(logits) pred = torch.cat((1 - pred, pred), dim=-1) else: pred = F.softmax(logits, dim=-1) # The numerator part of the GCE loss numerator = (pred ** self.q) # Make the numerator more numerically stable (this is not in the original loss) pred = (numerator + self.eps).where(pred >= 0, -((pred.abs() ** self.q) + self.eps)) # The denominator part of the GCE loss loss = (1. - pred) / (self.q + self.eps) # Here we construct the array we want to write the smoothing into pre_smoothed = pred.new_ones(pred.size()) * self.smoothing / (pred.size(-1) - 1.) # Given the targets' positions, we write the smoothing factors in the correct spot # (inverted because if we have smoothing=0.1, then the highest probability will be 0.9 and # the remaining 0.1 will be scattered) pre_smoothed = pre_smoothed.scatter_(-1, targets.unsqueeze(-1), (1.0 - self.smoothing)) # Apply the smoothing smoothed = pre_smoothed * loss # Mean reduction to collapse the loss into a single number loss = (smoothed + self.eps).mean() return loss

Again, we reuse the PyTorch default implementation for the CE loss and its label smoothing parameter.

For the actual Smooth Generalized Cross-Entropy (S-GCE) loss part, we have to implement it manually. We first construct the pre_smoothed array which will contain the smoothing transformation. We take the targets array which contains the index of the ground truth targets and use those indices to write the smoothing factors 1.0 - smoothing. We invert the smoothing parameter because if we have smoothing=0.1 for example, then we want the highest probability, so 1.0, to become 0.9 which is 1.0 - 0.1. Finally, we apply this array by multiplying it by the loss on the targets.

Training more models

Below, you’ll see the Expected Calibration Error (ECE) plotted against the Data Noise level (n) on several fine-tuned distilbert based-models:

Expected Calibration Error (ECE) plotted against Data Noise level (n) on several trained models with q ∈ [0.0, 0.6, 0.8], n ∈ [0.0, 0.2, 0.6]
Expected Calibration Error (ECE) plotted against Data Noise level (n) on several trained models with q ∈ [0.0, 0.6, 0.8], n ∈ [0.0, 0.2, 0.6]
smoothing factor s ∈ [0.0, 0.05, 0.1, 0.15, 0.2]. Left:Test set,Right:Validation set.Lower ECE is better.
smoothing factor s ∈ [0.0, 0.05, 0.1, 0.15, 0.2]. Left:Test set,Right:Validation set.Lower ECE is better.

The plots definitely show that under no noise, the CE loss is still very good at calibrating a model if the dataset is well made. Unfortunately, we show that the metrics for the CE loss degrade very quickly on average for noisy scenarios. This means the loss is very sensitive to noise as we have explained earlier since it’ll focus on learning some of those bad samples which can confuse the model and create an imbalance in the model’s confidence distribution even if the accuracy is very high.

Another thing is that even when using smoothing with the CE loss, we are not able to reduce the calibration error.

The average case is important, but minimizing the worst-case scenario is also important, so let’s look at the Maximum Calibration Error (MCE):

Maximum Calibration Error (MCE) plotted against Data Noise level (n) on several trained models with q ∈ [0.0, 0.6, 0.8], n ∈ [0.0, 0.2, 0.6]
Maximum Calibration Error (MCE) plotted against Data Noise level (n) on several trained models with q ∈ [0.0, 0.6, 0.8], n ∈ [0.0, 0.2, 0.6]
smoothing factor s ∈ [0.0, 0.05, 0.1, 0.15, 0.2]. Left: Test set, Right: Validation set. Lower MCE is better.
smoothing factor s ∈ [0.0, 0.05, 0.1, 0.15, 0.2]. Left: Test set, Right: Validation set. Lower MCE is better.

Interestingly, the MCE behaves quite differently, but it’s not that unexpected since it measures the maximum difference between average confidence and accuracy across bins. Nonetheless, we can still see that the more noise we have in the data, the less stable the CE loss gets at providing good confidence. The CE loss matches the SGCE loss until a reasonable amount of noise, then it is completely outmatched. The SGCE loss is actually quite stable across different noise levels, except for an outlier.

While CE and MCE aggregate bin statistics into a simple scalar metric, which helps us visualize lots of different trainings, reliability diagrams enable the statistics of each bin to be viewed at once. So let’s look at a few of them and only the test set to not clutter things :

SGCE and CE test set reliability plots with 10 bins with n = 0.6 noise
SGCE and CE test set reliability plots with 10 bins with n = 0.6 noise

In a highly noisy context, the SGCE loss is much better calibrated and gets better with a smoothing factor around 0.1 < s < 0.2, while the CE loss doesn’t really get better since it’s already struggling with learning the noise in the first place.

Now, if there is no noise, things are different:

The CE loss is better implicitly calibrated since it learns the correctly balanced data better and doesn’t try to attenuate nonexisting noise. The SGCE loss needs a low q value to perform similarly which is basically just approximating the CE loss.

So as we can see, the SGCE loss is much better in both accuracy and calibration metrics in almost all scenarios. And it can become the CE loss anyway with q=0, so it is indeed a generalization of the CE loss.

The Intuition

Why does smoothing work? The intuition is that smoothing is also trying to provide a more equal chance for every label to appear in the prediction which is relevant the more mislabeled data we have:

This is a similar principle to what the MAE property achieves at the sample level, but here we do it at the token level!

Of course, training will collapse if you increase s to a high enough value since we then approach the same prediction value  for each label which is equivalent to predicting every label with the same probability.


A Note on Model Knowledge Distillation

If your goal is to train a teacher model which will then get distilled into a student model then you should be careful with using smoothing. Indeed, the distillation process will be much less effective and it has been observed that smoothing will encourage the embedded representations of similar samples to be grouped in tight clusters.

Visualization of the&nbsp;penultimate layer’s activation&nbsp;of CIFAR-100/ResNet-56 (Source:&nbsp;When Does Label Smoothing Help?)
Visualization of the&nbsp;penultimate layer’s activation&nbsp;of CIFAR-100/ResNet-56 (Source:&nbsp;When Does Label Smoothing Help?)

As observed, this is probably caused by the fact that models trained with and without smoothing have different scales of their embeddings (as we can see in the graphs). Without smoothing the representations have much higher absolute values which can represent over-confident predictions while trained models with smoothing will have semantically similar samples embedded in tighter clusters which means they’ll eventually get mapped to similar logits.

One other interesting thing their paper shows is that you have much less continuity between classes meaning you can lose intermediate meanings for some classes. For example, “blue” could be a cluster, and “glasses” another one, you could easily find “blue glasses” in-between before smoothing but after this information can get lost or maybe expended into one or both clusters, meaning it has become discrete.

Furthermore, the correlation between compression of layer representations and generalization has been shown in some papers, which may explain why networks trained with label smoothing still generalize so well. Another paper also shows this relates to the information bottleneck theory that explains generalization in terms of compression.

Up to you to dive deeper and see what could be the best explanation 😁

Personally, I think that this problem can also be explained by our current methods for distillation which usually use the zero-avoiding forward Kullback–Leibler (KL) divergence (great article if you don’t know the difference) that will try to spread out the approximated teacher distribution Q(x) across the clusters.

Quick graph to show a Student model Q(x) trying to approximate Teacher model P(x) using the forward KL divergence
Quick graph to show a Student model Q(x) trying to approximate Teacher model P(x) using the forward KL divergence

In our case, if smoothing tightens the clusters, then the Q(x) will be very spread out and won’t actually approximate the student model’s predictions. It will actually approximate what would have been the continuous representation of the model trained without smoothing since we lost the continuity between classes if you remember 😀

We didn’t dive into distillation after the SGCE loss training, but I’d suppose that using the reverse KL divergence could help alleviate the problem and fit something that works:

Quick graph to show a Student model Q(x) trying to approximate Teacher model P(x) using the reverse KL divergence
Quick graph to show a Student model Q(x) trying to approximate Teacher model P(x) using the reverse KL divergence

It seems like a better compromise to fit well the bigger part of a distribution rather than badly fit something different that kind of covers everything. It should enable capturing the most relevant information for a student model to then learn the bigger picture of the data.

Another idea could be trying to alternate between fitting a model using both the forward and reverse KL divergence to mix their advantages (similarly to how we mixed the CE and MAE properties) which is an idea we’ve been thinking about for a while now even if it seems quite computationally intensive.

This ended up being a long note on distillation but if anyone has a better explanation, please feel free to let me know 😄


Conclusion

At Ntropy, we have been doing our best to control the entire pipeline of training for our models, from data collection to cleaning to training. Everything should be taken into account to achieve the best performance possible. Transactional data can be very ambiguous, so quality control is very important so our model learns the best data it can. This is why we have to do fundamental research on every part of our pipeline, including even the unsuspected Cross-Entropy loss function 🧐

Now, let’s summarize everything from this post:

  • The Generalized Cross-Entropy (GCE) loss is a generalization of both the Cross-Entropy (CE) loss and the Mean Absolute Error (MAE)
  • The q parameter is usually used between 0 and 1 and getting a rough value is possible by answering this question: “How much noise suppression do I want during training?”
  • Confidence calibration is very important for real-world production scenarios
  • The Smooth Generalized Cross-Entropy (SGCE) is an extension of the loss that enables even better calibration without much-added tricks (like temperature scaling) using the smoothing parameter s between 0 and 1
  • Smoothing is a similar principle to the MAE property that operates at the sample level but at the token level
  • The Cross-Entropy (CE) loss, which is equivalent to the Smooth Generalized Cross-Entropy (SGCE) with q=0, is still a good choice for clean datasets
  • The Mean Absolute Error (MAE) loss is a bad choice for most datasets
  • The code is available here
  • Distillation after smoothing is still a research topic 😀

If you enjoyed reading this article, don’t hesitate to subscribe and post comments if you have any questions!

Join hundreds of companies taking control of their transactions

Ntropy is the most accurate financial data standardization and enrichment API. Any data source, any geography.