On Calibration of Modern Neural Networks
~7192 words, ~28 min read
Abstract
LeNet (1998) ResNet (2016) CIFAR-100 CIFAR-100
Modern neural networks often output confidence scores that are too high relative to their true correctness rate. The paper shows this systematically, then demonstrates that a one-parameter post-hoc correction, temperature scaling, is often enough to repair calibration without retraining the model.
Key Takeaways
- Modern deep networks can achieve high accuracy while still being overconfident, so calibration and accuracy are separate properties.
- Common modern training choices often improve accuracy while making calibration worse.
- Simple temperature scaling provides a strong post-hoc fix for many real classification systems.
Full Text
Extracted text (click to expand)
On Calibration of Modern Neural Networks
Chuan Guo * 1 Geoff Pleiss * 1 Yu Sun * 1 Kilian Q. Weinberger 1
Abstract LeNet (1998) ResNet (2016)
CIFAR-100 CIFAR-100
1.0
Confidence calibration – the problem of predict-
Avg. confidence
Accuracy
Avg. confidence
ing probability estimates representative of the 0.8
% of Samples
true correctness likelihood – is important for
0.6
classification models in many applications. We
Accuracy
discover that modern neural networks, unlike 0.4
those from a decade ago, are poorly calibrated.
Through extensive experiments, we observe that 0.2
depth, width, weight decay, and Batch Normal- 0.0
ization are important factors influencing calibra- 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
tion. We evaluate the performance of various 1.0
Outputs Outputs
post-processing calibration methods on state-of- 0.8 Gap Gap
the-art architectures with image and document
Accuracy
classification datasets. Our analysis and exper- 0.6
iments not only offer insights into neural net- 0.4
work learning, but also provide a simple and
straightforward recipe for practical settings: on 0.2
Error=44.9 Error=30.6
most datasets, temperature scaling – a single- 0.0
parameter variant of Platt Scaling – is surpris- 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
ingly effective at calibrating predictions. Confidence
Figure 1. Confidence histograms (top) and reliability diagrams
(bottom) for a 5-layer LeNet (left) and a 110-layer ResNet (right)
1. Introduction on CIFAR-100. Refer to the text below for detailed illustration.
Recent advances in deep learning have dramatically im-
If the detection network is not able to confidently predict
proved neural network accuracy (Simonyan & Zisserman,
the presence or absence of immediate obstructions, the car
2015; Srivastava et al., 2015; He et al., 2016; Huang et al.,
should rely more on the output of other sensors for braking.
2016; 2017). As a result, neural networks are now entrusted
Alternatively, in automated health care, control should be
with making complex decisions in applications, such as ob-
passed on to human doctors when the confidence of a dis-
ject detection (Girshick, 2015), speech recognition (Han-
ease diagnosis network is low (Jiang et al., 2012). Specif-
nun et al., 2014), and medical diagnosis (Caruana et al.,
ically, a network should provide a calibrated confidence
2015). In these settings, neural networks are an essential
measure in addition to its prediction. In other words, the
component of larger decision making pipelines.
probability associated with the predicted class label should
In real-world decision making systems, classification net- reflect its ground truth correctness likelihood.
works must not only be accurate, but also should indicate
Calibrated confidence estimates are also important for
when they are likely to be incorrect. As an example, con-
model interpretability. Humans have a natural cognitive in-
sider a self-driving car that uses a neural network to detect
tuition for probabilities (Cosmides & Tooby, 1996). Good
pedestrians and other obstructions (Bojarski et al., 2016).
confidence estimates provide a valuable extra bit of infor-
*
Equal contribution, alphabetical order. 1 Cornell University. mation to establish trustworthiness with the user – espe-
Correspondence to: Chuan Guo <cg563@cornell.edu>, Geoff cially for neural networks, whose classification decisions
Pleiss <geoff@cs.cornell.edu>, Yu Sun <ys646@cornell.edu>. are often difficult to interpret. Further, good probability
estimates can be used to incorporate neural networks into
Proceedings of the 34 th International Conference on Machine
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 other probabilistic models. For example, one can improve
by the author(s). performance by combining network outputs with a lan-
guage model in speech recognition (Hannun et al., 2014; 0.8, we expect that 80 should be correctly classified. More
Xiong et al., 2016), or with camera information for object formally, we define perfect calibration as
detection (Kendall & Cipolla, 2016).
P Ŷ = Y | P̂ = p = p, ∀p ∈ [0, 1] (1)
In 2005, Niculescu-Mizil & Caruana (2005) showed that
where the probability is over the joint distribution. In all
neural networks typically produce well-calibrated proba-
practical settings, achieving perfect calibration is impos-
bilities on binary classification tasks. While neural net-
sible. Additionally, the probability in (1) cannot be com-
works today are undoubtedly more accurate than they were
puted using finitely many samples since P̂ is a continuous
a decade ago, we discover with great surprise that mod-
random variable. This motivates the need for empirical ap-
ern neural networks are no longer well-calibrated. This
proximations that capture the essence of (1).
is visualized in Figure 1, which compares a 5-layer LeNet
(left) (LeCun et al., 1998) with a 110-layer ResNet (right)
(He et al., 2016) on the CIFAR-100 dataset. The top row Reliability Diagrams (e.g. Figure 1 bottom) are a visual
shows the distribution of prediction confidence (i.e. prob- representation of model calibration (DeGroot & Fienberg,
abilities associated with the predicted label) as histograms. 1983; Niculescu-Mizil & Caruana, 2005). These diagrams
The average confidence of LeNet closely matches its accu- plot expected sample accuracy as a function of confidence.
racy, while the average confidence of the ResNet is substan- If the model is perfectly calibrated – i.e. if (1) holds – then
tially higher than its accuracy. This is further illustrated in the diagram should plot the identity function. Any devia-
the bottom row reliability diagrams (DeGroot & Fienberg, tion from a perfect diagonal represents miscalibration.
1983; Niculescu-Mizil & Caruana, 2005), which show ac- To estimate the expected accuracy from finite samples, we
curacy as a function of confidence. We see that LeNet is group predictions into M interval bins (each of size 1/M )
well-calibrated, as confidence closely approximates the ex- and calculate the accuracy of each bin. Let Bm be the set
pected accuracy (i.e. the bars align roughly along the diag- of indices of samples whose prediction confidence falls into
onal). On the other hand, the ResNet’s accuracy is better,
M , M ]. The accuracy of Bm is
the interval Im = ( m−1 m
but does not match its confidence.
1 X
Our goal is not only to understand why neural networks acc(Bm ) = 1(ŷi = yi ),
|Bm |
i∈Bm
have become miscalibrated, but also to identify what meth-
ods can alleviate this problem. In this paper, we demon- where ŷi and yi are the predicted and true class labels for
strate on several computer vision and NLP tasks that neu- sample i. Basic probability tells us that acc(Bm ) is an un-
ral networks produce confidences that cannot represent true biased and consistent estimator of P(Ŷ = Y | P̂ ∈ Im ).
probabilities. Additionally, we offer insight and intuition We define the average confidence within bin Bm as
into network training and architectural trends that may 1 X
conf(Bm ) = p̂i ,
cause miscalibration. Finally, we compare various post- |Bm |
i∈Bm
processing calibration methods on state-of-the-art neural
networks, and introduce several extensions of our own. where p̂i is the confidence for sample i. acc(Bm ) and
Surprisingly, we find that a single-parameter variant of Platt conf(Bm ) approximate the left-hand and right-hand sides
scaling (Platt et al., 1999) – which we refer to as temper- of (1) respectively for bin Bm . Therefore, a perfectly cal-
ature scaling – is often the most effective method at ob- ibrated model will have acc(Bm ) = conf(Bm ) for all
taining calibrated probabilities. Because this method is m ∈ {1, . . . , M }. Note that reliability diagrams do not dis-
straightforward to implement with existing deep learning play the proportion of samples in a given bin, and thus can-
frameworks, it can be easily adopted in practical settings. not be used to estimate how many samples are calibrated.
2. Definitions Expected Calibration Error (ECE). While reliability
diagrams are useful visual tools, it is more convenient to
The problem we address in this paper is supervised multi- have a scalar summary statistic of calibration. Since statis-
class classification with neural networks. The input X ∈ X tics comparing two distributions cannot be comprehensive,
and label Y ∈ Y = {1, . . . , K} are random variables previous works have proposed variants, each with a unique
that follow a ground truth joint distribution π(X, Y ) = emphasis. One notion of miscalibration is the difference in
π(Y |X)π(X). Let h be a neural network with h(X) = expectation between confidence and accuracy, i.e.
(Ŷ , P̂ ), where Ŷ is a class prediction and P̂ is its associ- h i
E P Ŷ = Y | P̂ = p − p (2)
ated confidence, i.e. probability of correctness. We would P̂
like the confidence estimate P̂ to be calibrated, which in- Expected Calibration Error (Naeini et al., 2015) – or ECE
tuitively means that P̂ represents a true probability. For – approximates (2) by partitioning predictions into M
example, given 100 predictions, each with confidence of equally-spaced bins (similar to the reliability diagrams) and
Varying Depth Varying Width Using Normalization Varying Weight Decay
ResNet - CIFAR-100 ResNet-14 - CIFAR-100 ConvNet - CIFAR-100 ResNet-110 - CIFAR-100
0.7
Error Error Error Error
0.6
ECE ECE ECE ECE
0.5
Error/ECE
0.4
0.3
0.2
0.1
0.0
0 20 40 60 80 100 120 0 50 100 150 200 250 300 Without With 10−5 10−4 10−3 10−2
Depth Filters per layer Batch Normalization Weight decay
Figure 2. The effect of network depth (far left), width (middle left), Batch Normalization (middle right), and weight decay (far right) on
miscalibration, as measured by ECE (lower is better).
taking a weighted average of the bins’ accuracy/confidence 3. Observing Miscalibration
difference. More precisely,
The architecture and training procedures of neural net-
M
X |Bm | works have rapidly evolved in recent years. In this sec-
ECE = acc(Bm ) − conf(Bm ) , (3)
n tion we identify some recent changes that are responsible
m=1
for the miscalibration phenomenon observed in Figure 1.
where n is the number of samples. The difference between
Though we cannot claim causality, we find that model
acc and conf for a given bin represents the calibration gap
capacity and lack of regularization are closely related to
(red bars in reliability diagrams – e.g. Figure 1). We use
model (mis)calibration.
ECE as the primary empirical metric to measure calibra-
tion. See Section S1 for more analysis of this metric.
Model capacity. The model capacity of neural networks
Maximum Calibration Error (MCE). In high-risk ap- has increased at a dramatic pace over the past few years.
plications where reliable confidence measures are abso- It is now common to see networks with hundreds, if not
lutely necessary, we may wish to minimize the worst-case thousands of layers (He et al., 2016; Huang et al., 2016)
deviation between confidence and accuracy: and hundreds of convolutional filters per layer (Zagoruyko
& Komodakis, 2016). Recent work shows that very deep
max P Ŷ = Y | P̂ = p − p . (4) or wide models are able to generalize better than smaller
p∈[0,1]
ones, while exhibiting the capacity to easily fit the training
The Maximum Calibration Error (Naeini et al., 2015) – or set (Zhang et al., 2017).
MCE – estimates an upper bound of this deviation. Simi-
larly to ECE, this approximation involves binning: Although increasing depth and width may reduce classi-
fication error, we observe that these increases negatively
MCE = max |acc(Bm ) − conf(Bm )| . (5)
m∈{1,...,M } affect model calibration. Figure 2 displays error and ECE
In reliability diagrams, MCE measures the largest calibra- as a function of depth and width on a ResNet trained on
tion gap (red bars) across all bins, whereas ECE measures a CIFAR-100. The far left figure varies depth for a network
weighted average of all gaps. For perfectly calibrated clas- with 64 convolutional filters per layer, while the middle left
sifiers, MCE and ECE both equal 0. figure fixes the depth at 14 layers and varies the number
of convolutional filters per layer. Though even the small-
Negative log likelihood is a standard measure of a prob- est models in the graph exhibit some degree of miscalibra-
abilistic model’s quality (Friedman et al., 2001). It is also tion, the ECE metric grows substantially with model ca-
referred to as the cross entropy loss in the context of deep pacity. During training, after the model is able to correctly
learning (Bengio et al., 2015). Given a probabilistic model classify (almost) all training samples, NLL can be further
π̂(Y |X) and n samples, NLL is defined as: minimized by increasing the confidence of predictions. In-
n
creased model capacity will lower training NLL, and thus
X
L=− log(π̂(yi |xi )) (6) the model will be more (over)confident on average.
i=1
It is a standard result (Friedman et al., 2001) that, in expec- Batch Normalization (Ioffe & Szegedy, 2015) improves
tation, NLL is minimized if and only if π̂(Y |X) recovers the optimization of neural networks by minimizing distri-
the ground truth conditional distribution π(Y |X). bution shifts in activations within the neural network’s hid-
NLL Overfitting on CIFAR−100 plays training error and ECE for a 110-layer ResNet with
45
Test error varying amounts of weight decay. The only other forms
Test NLL of regularization are data augmentation and Batch Normal-
40 ization. We observe that calibration and accuracy are not
Error (%) / NLL (scaled)
optimized by the same parameter setting. While the model
35 exhibits both over-regularization and under-regularization
with respect to classification error, it does not appear that
calibration is negatively impacted by having too much
30
weight decay. Model calibration continues to improve
when more regularization is added, well after the point of
25 achieving optimal accuracy. The slight uptick at the end of
the graph may be an artifact of using a weight decay factor
that impedes optimization.
20
0 100 200 300 400 500
Epoch
NLL can be used to indirectly measure model calibra-
tion. In practice, we observe a disconnect between NLL
Figure 3. Test error and NLL of a 110-layer ResNet with stochas-
tic depth on CIFAR-100 during training. NLL is scaled by a con- and accuracy, which may explain the miscalibration in Fig-
stant to fit in the figure. Learning rate drops by 10x at epochs 250 ure 2. This disconnect occurs because neural networks can
and 375. The shaded area marks between epochs at which the best overfit to NLL without overfitting to the 0/1 loss. We ob-
validation loss and best validation error are produced. serve this trend in the training curves of some miscalibrated
models. Figure 3 shows test error and NLL (rescaled to
den layers. Recent research suggests that these normal- match error) on CIFAR-100 as training progresses. Both
ization techniques have enabled the development of very error and NLL immediately drop at epoch 250, when the
deep architectures, such as ResNets (He et al., 2016) and learning rate is dropped; however, NLL overfits during the
DenseNets (Huang et al., 2017). It has been shown that remainder of training. Surprisingly, overfitting to NLL is
Batch Normalization improves training time, reduces the beneficial to classification accuracy. On CIFAR-100, test
need for additional regularization, and can in some cases error drops from 29% to 27% in the region where NLL
improve the accuracy of networks. overfits. This phenomenon renders a concrete explanation
While it is difficult to pinpoint exactly how Batch Normal- of miscalibration: the network learns better classification
ization affects the final predictions of a model, we do ob- accuracy at the expense of well-modeled probabilities.
serve that models trained with Batch Normalization tend to We can connect this finding to recent work examining the
be more miscalibrated. In the middle right plot of Figure 2, generalization of large neural networks. Zhang et al. (2017)
we see that a 6-layer ConvNet obtains worse calibration observe that deep neural networks seemingly violate the
when Batch Normalization is applied, even though classi- common understanding of learning theory that large mod-
fication accuracy improves slightly. We find that this result els with little regularization will not generalize well. The
holds regardless of the hyperparameters used on the Batch observed disconnect between NLL and 0/1 loss suggests
Normalization model (i.e. low or high learning rate, etc.). that these high capacity models are not necessarily immune
from overfitting, but rather, overfitting manifests in proba-
Weight decay, which used to be the predominant regu- bilistic error rather than classification error.
larization mechanism for neural networks, is decreasingly
utilized when training modern neural networks. Learning 4. Calibration Methods
theory suggests that regularization is necessary to prevent
overfitting, especially as model capacity increases (Vapnik, In this section, we first review existing calibration meth-
1998). However, due to the apparent regularization effects ods, and introduce new variants of our own. All methods
of Batch Normalization, recent research seems to suggest are post-processing steps that produce (calibrated) proba-
that models with less L2 regularization tend to generalize bilities. Each method requires a hold-out validation set,
better (Ioffe & Szegedy, 2015). As a result, it is now com- which in practice can be the same set used for hyperparam-
mon to train models with little weight decay, if any at all. eter tuning. We assume that the training, validation, and
The top performing ImageNet models of 2015 all use an or- test sets are drawn from the same distribution.
der of magnitude less weight decay than models of previous
years (He et al., 2016; Simonyan & Zisserman, 2015). 4.1. Calibrating Binary Models
We find that training with less weight decay has a negative We first introduce calibration in the binary setting, i.e.
impact on calibration. The far right plot in Figure 2 dis- Y = {0, 1}. For simplicity, throughout this subsection,
we assume the model outputs only the confidence for the model averaging. Essentially, BBQ marginalizes out all
positive class.1 Given a sample xi , we have access to p̂i – possible binning schemes to produce q̂i . More formally, a
the network’s predicted probability of yi = 1, as well as binning scheme s is a pair (M, I) where M is the number
zi ∈ R – which is the network’s non-probabilistic output, of bins, and I is a corresponding partitioning of [0, 1] into
or logit. The predicted probability p̂i is derived from zi us- disjoint intervals (0 = a1 ≤ a2 ≤ . . . ≤ aM +1 = 1). The
ing a sigmoid function σ; i.e. p̂i = σ(zi ). Our goal is to parameters of a binning scheme are θ1 , . . . , θM . Under this
produce a calibrated probability q̂i based on yi , p̂i , and zi . framework, histogram binning and isotonic regression both
produce a single binning scheme, whereas BBQ considers
Histogram binning (Zadrozny & Elkan, 2001) is a sim- a space S of all possible binning schemes for the valida-
ple non-parametric calibration method. In a nutshell, all tion dataset D. BBQ performs Bayesian averaging of the
uncalibrated predictions p̂i are divided into mutually ex- probabilities produced by each scheme:2
X
clusive bins B1 , . . . , BM . Each bin is assigned a calibrated P(q̂te | p̂te , D) = P(q̂te , S = s | p̂te , D)
score θm ; i.e. if p̂i is assigned to bin Bm , then q̂i = θm . At s∈S
test time, if prediction p̂te falls into bin Bm , then the cali- X
brated prediction q̂te is θm . More precisely, for a suitably
= P(q̂te | p̂te , S = s, D) P(S = s | D).
s∈S
chosen M (usually small), we first define bin boundaries
0 = a1 ≤ a2 ≤ . . . ≤ aM +1 = 1, where the bin Bm where P(q̂te | p̂te , S = s, D) is the calibrated probability
is defined by the interval (am , am+1 ]. Typically the bin using binning scheme s. Using a uniform prior, the weight
boundaries are either chosen to be equal length intervals or P(S = s | D) can be derived using Bayes’ rule:
to equalize the number of samples in each bin. The predic- P(D | S = s)
tions θi are chosen to minimize the bin-wise squared loss: P(S = s | D) = P 0
.
s0 ∈S P(D | S = s )
The parameters θ1 , . . . , θM can be viewed as parameters of
X n
M X
2 M independent binomial distributions. Hence, by placing
min 1(am ≤ p̂i < am+1 ) (θm − yi ) , (7)
θ1 ,...,θM a Beta prior on θ1 , . . . , θM , we can obtain a closed form
m=1 i=1
expression for the marginal likelihood P(D | S = s). This
where 1 is the indicator function. Given fixed bins bound-
allows us to compute P(q̂te | p̂te , D) for any test input.
aries, the solution to (7) results in θm that correspond to the
average number of positive-class samples in bin Bm .
Platt scaling (Platt et al., 1999) is a parametric approach
to calibration, unlike the other approaches. The non-
Isotonic regression (Zadrozny & Elkan, 2002), arguably probabilistic predictions of a classifier are used as features
the most common non-parametric calibration method, for a logistic regression model, which is trained on the val-
learns a piecewise constant function f to transform un- idation set to return probabilities. More specifically, in the
calibrated outputs; i.e. q̂i = f (p̂i ). Specifically, iso- context of neural networks (Niculescu-Mizil & Caruana,
Pn regression produces
tonic f to minimize the square loss 2005), Platt scaling learns scalar parameters a, b ∈ R and
i=1 (f (p̂i ) − y i )2
. Because f is constrained to be piece- outputs q̂i = σ(azi + b) as the calibrated probability. Pa-
wise constant, we can write the optimization problem as: rameters a and b can be optimized using the NLL loss over
the validation set. It is important to note that the neural
M X
X n
2 network’s parameters are fixed during this stage.
min 1(am ≤ p̂i < am+1 ) (θm − yi )
M
θ1 ,...,θM m=1 i=1
a1 ,...,aM +1 4.2. Extension to Multiclass Models
subject to 0 = a1 ≤ a2 ≤ . . . ≤ aM +1 = 1, For classification problems involving K > 2 classes, we
θ1 ≤ θ2 ≤ . . . ≤ θM . return to the original problem formulation. The network
where M is the number of intervals; a1 , . . . , aM +1 are the outputs a class prediction ŷi and confidence score p̂i for
interval boundaries; and θ1 , . . . , θM are the function val- each input xi . In this case, the network logits zi are vectors,
(k)
ues. Under this parameterization, isotonic regression is a where ŷi = argmaxk zi , and p̂i is typically derived using
strict generalization of histogram binning in which the bin the softmax function σSM :
boundaries and bin predictions are jointly optimized. exp(zi )
(k)
σSM (zi )(k) = PK (j)
, p̂i = max σSM (zi )(k) .
k
j=1 exp(zi )
Bayesian Binning into Quantiles (BBQ) (Naeini et al.,
2015) is a extension of histogram binning using Bayesian The goal is to produce a calibrated confidence q̂i and (pos-
sibly new) class prediction ŷi0 based on yi , ŷi , p̂i , and zi .
1
This is in contrast with the setting in Section 2, in which the
2
model produces both a class prediction and confidence. Because the validation dataset is finite, S is as well.
Dataset Model Uncalibrated Hist. Binning Isotonic BBQ Temp. Scaling Vector Scaling Matrix Scaling
Birds ResNet 50 9.19% 4.34% 5.22% 4.12% 1.85% 3.0% 21.13%
Cars ResNet 50 4.3% 1.74% 4.29% 1.84% 2.35% 2.37% 10.5%
CIFAR-10 ResNet 110 4.6% 0.58% 0.81% 0.54% 0.83% 0.88% 1.0%
CIFAR-10 ResNet 110 (SD) 4.12% 0.67% 1.11% 0.9% 0.6% 0.64% 0.72%
CIFAR-10 Wide ResNet 32 4.52% 0.72% 1.08% 0.74% 0.54% 0.6% 0.72%
CIFAR-10 DenseNet 40 3.28% 0.44% 0.61% 0.81% 0.33% 0.41% 0.41%
CIFAR-10 LeNet 5 3.02% 1.56% 1.85% 1.59% 0.93% 1.15% 1.16%
CIFAR-100 ResNet 110 16.53% 2.66% 4.99% 5.46% 1.26% 1.32% 25.49%
CIFAR-100 ResNet 110 (SD) 12.67% 2.46% 4.16% 3.58% 0.96% 0.9% 20.09%
CIFAR-100 Wide ResNet 32 15.0% 3.01% 5.85% 5.77% 2.32% 2.57% 24.44%
CIFAR-100 DenseNet 40 10.37% 2.68% 4.51% 3.59% 1.18% 1.09% 21.87%
CIFAR-100 LeNet 5 4.85% 6.48% 2.35% 3.77% 2.02% 2.09% 13.24%
ImageNet DenseNet 161 6.28% 4.52% 5.18% 3.51% 1.99% 2.24% -
ImageNet ResNet 152 5.48% 4.36% 4.77% 3.56% 1.86% 2.23% -
SVHN ResNet 152 (SD) 0.44% 0.14% 0.28% 0.22% 0.17% 0.27% 0.17%
20 News DAN 3 8.02% 3.6% 5.52% 4.98% 4.11% 4.61% 9.1%
Reuters DAN 3 0.85% 1.75% 1.15% 0.97% 0.91% 0.66% 1.58%
SST Binary TreeLSTM 6.63% 1.93% 1.65% 2.27% 1.84% 1.84% 1.84%
SST Fine Grained TreeLSTM 6.71% 2.09% 1.65% 2.61% 2.56% 2.98% 2.39%
Table 1. ECE (%) (with M = 15 bins) on standard vision and NLP datasets before calibration and with various calibration methods.
The number following a model’s name denotes the network depth.
Extension of binning methods. One common way of ex- T is called the temperature, and it “softens” the softmax
tending binary calibration methods to the multiclass setting (i.e. raises the output entropy) with T > 1. As T → ∞,
is by treating the problem as K one-versus-all problems the probability q̂i approaches 1/K, which represents max-
(Zadrozny & Elkan, 2002). For k = 1, . . . , K, we form a imum uncertainty. With T = 1, we recover the original
binary calibration problem where the label is 1(yi = k) probability p̂i . As T → 0, the probability collapses to a
and the predicted probability is σSM (zi )(k) . This gives point mass (i.e. q̂i = 1). T is optimized with respect to
us K calibration models, each for a particular class. At NLL on the validation set. Because the parameter T does
test time, we obtain an unnormalized probability vector not change the maximum of the softmax function, the class
(1) (K) (k)
[q̂i , . . . , q̂i ], where q̂i is the calibrated probability for prediction ŷi0 remains unchanged. In other words, temper-
class k. The new class prediction ŷi0 is the argmax of the ature scaling does not affect the model’s accuracy.
vector, and the new confidence q̂i0 is the max of the vector Temperature scaling is commonly used in settings such as
PK (k)
normalized by k=1 q̂i . This extension can be applied