On Calibration of Modern Neural Networks

~7192 words, ~28 min read

Abstract

LeNet (1998) ResNet (2016) CIFAR-100 CIFAR-100

Modern neural networks often output confidence scores that are too high relative to their true correctness rate. The paper shows this systematically, then demonstrates that a one-parameter post-hoc correction, temperature scaling, is often enough to repair calibration without retraining the model.

Key Takeaways

  1. Modern deep networks can achieve high accuracy while still being overconfident, so calibration and accuracy are separate properties.
  2. Common modern training choices often improve accuracy while making calibration worse.
  3. Simple temperature scaling provides a strong post-hoc fix for many real classification systems.

Full Text

Extracted text (click to expand)
                             On Calibration of Modern Neural Networks


                            Chuan Guo * 1 Geoff Pleiss * 1 Yu Sun * 1 Kilian Q. Weinberger 1


                          Abstract                                                         LeNet (1998)             ResNet (2016)
                                                                                            CIFAR-100                CIFAR-100
                                                                                   1.0
      Confidence calibration – the problem of predict-

                                                                                              Avg. confidence
                                                                                                                             Accuracy

                                                                                                                                        Avg. confidence
      ing probability estimates representative of the                              0.8



                                                                    % of Samples
      true correctness likelihood – is important for
                                                                                   0.6
      classification models in many applications. We
                                                                                                    Accuracy
      discover that modern neural networks, unlike                                 0.4
      those from a decade ago, are poorly calibrated.
      Through extensive experiments, we observe that                               0.2
      depth, width, weight decay, and Batch Normal-                                0.0
      ization are important factors influencing calibra-                              0.0 0.2 0.4 0.6 0.8 1.0   0.0 0.2 0.4 0.6 0.8 1.0
      tion. We evaluate the performance of various                                 1.0
                                                                                             Outputs                   Outputs
      post-processing calibration methods on state-of-                             0.8       Gap                       Gap
      the-art architectures with image and document

                                                                    Accuracy
      classification datasets. Our analysis and exper-                             0.6
      iments not only offer insights into neural net-                              0.4
      work learning, but also provide a simple and
      straightforward recipe for practical settings: on                            0.2
                                                                                              Error=44.9                Error=30.6
      most datasets, temperature scaling – a single-                               0.0
      parameter variant of Platt Scaling – is surpris-                                0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
      ingly effective at calibrating predictions.                                                     Confidence

                                                                     Figure 1. Confidence histograms (top) and reliability diagrams
                                                                     (bottom) for a 5-layer LeNet (left) and a 110-layer ResNet (right)
1. Introduction                                                      on CIFAR-100. Refer to the text below for detailed illustration.
Recent advances in deep learning have dramatically im-
                                                                     If the detection network is not able to confidently predict
proved neural network accuracy (Simonyan & Zisserman,
                                                                     the presence or absence of immediate obstructions, the car
2015; Srivastava et al., 2015; He et al., 2016; Huang et al.,
                                                                     should rely more on the output of other sensors for braking.
2016; 2017). As a result, neural networks are now entrusted
                                                                     Alternatively, in automated health care, control should be
with making complex decisions in applications, such as ob-
                                                                     passed on to human doctors when the confidence of a dis-
ject detection (Girshick, 2015), speech recognition (Han-
                                                                     ease diagnosis network is low (Jiang et al., 2012). Specif-
nun et al., 2014), and medical diagnosis (Caruana et al.,
                                                                     ically, a network should provide a calibrated confidence
2015). In these settings, neural networks are an essential
                                                                     measure in addition to its prediction. In other words, the
component of larger decision making pipelines.
                                                                     probability associated with the predicted class label should
In real-world decision making systems, classification net-           reflect its ground truth correctness likelihood.
works must not only be accurate, but also should indicate
                                                                     Calibrated confidence estimates are also important for
when they are likely to be incorrect. As an example, con-
                                                                     model interpretability. Humans have a natural cognitive in-
sider a self-driving car that uses a neural network to detect
                                                                     tuition for probabilities (Cosmides & Tooby, 1996). Good
pedestrians and other obstructions (Bojarski et al., 2016).
                                                                     confidence estimates provide a valuable extra bit of infor-
  *
    Equal contribution, alphabetical order. 1 Cornell University.    mation to establish trustworthiness with the user – espe-
Correspondence to: Chuan Guo <cg563@cornell.edu>, Geoff              cially for neural networks, whose classification decisions
Pleiss <geoff@cs.cornell.edu>, Yu Sun <ys646@cornell.edu>.           are often difficult to interpret. Further, good probability
                                                                     estimates can be used to incorporate neural networks into
Proceedings of the 34 th International Conference on Machine
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017           other probabilistic models. For example, one can improve
by the author(s).                                                    performance by combining network outputs with a lan-
guage model in speech recognition (Hannun et al., 2014;            0.8, we expect that 80 should be correctly classified. More
Xiong et al., 2016), or with camera information for object         formally, we define perfect calibration as
detection (Kendall & Cipolla, 2016).                                                           
                                                                             P Ŷ = Y | P̂ = p = p, ∀p ∈ [0, 1]            (1)
In 2005, Niculescu-Mizil & Caruana (2005) showed that
                                                                   where the probability is over the joint distribution. In all
neural networks typically produce well-calibrated proba-
                                                                   practical settings, achieving perfect calibration is impos-
bilities on binary classification tasks. While neural net-
                                                                   sible. Additionally, the probability in (1) cannot be com-
works today are undoubtedly more accurate than they were
                                                                   puted using finitely many samples since P̂ is a continuous
a decade ago, we discover with great surprise that mod-
                                                                   random variable. This motivates the need for empirical ap-
ern neural networks are no longer well-calibrated. This
                                                                   proximations that capture the essence of (1).
is visualized in Figure 1, which compares a 5-layer LeNet
(left) (LeCun et al., 1998) with a 110-layer ResNet (right)
(He et al., 2016) on the CIFAR-100 dataset. The top row            Reliability Diagrams (e.g. Figure 1 bottom) are a visual
shows the distribution of prediction confidence (i.e. prob-        representation of model calibration (DeGroot & Fienberg,
abilities associated with the predicted label) as histograms.      1983; Niculescu-Mizil & Caruana, 2005). These diagrams
The average confidence of LeNet closely matches its accu-          plot expected sample accuracy as a function of confidence.
racy, while the average confidence of the ResNet is substan-       If the model is perfectly calibrated – i.e. if (1) holds – then
tially higher than its accuracy. This is further illustrated in    the diagram should plot the identity function. Any devia-
the bottom row reliability diagrams (DeGroot & Fienberg,           tion from a perfect diagonal represents miscalibration.
1983; Niculescu-Mizil & Caruana, 2005), which show ac-             To estimate the expected accuracy from finite samples, we
curacy as a function of confidence. We see that LeNet is           group predictions into M interval bins (each of size 1/M )
well-calibrated, as confidence closely approximates the ex-        and calculate the accuracy of each bin. Let Bm be the set
pected accuracy (i.e. the bars align roughly along the diag-       of indices of samples whose prediction confidence falls into
onal). On the other hand, the ResNet’s accuracy is better,
                                                                                        M , M ]. The accuracy of Bm is
                                                                   the interval Im = ( m−1  m
but does not match its confidence.
                                                                                             1 X
Our goal is not only to understand why neural networks                         acc(Bm ) =              1(ŷi = yi ),
                                                                                           |Bm |
                                                                                                   i∈Bm
have become miscalibrated, but also to identify what meth-
ods can alleviate this problem. In this paper, we demon-           where ŷi and yi are the predicted and true class labels for
strate on several computer vision and NLP tasks that neu-          sample i. Basic probability tells us that acc(Bm ) is an un-
ral networks produce confidences that cannot represent true        biased and consistent estimator of P(Ŷ = Y | P̂ ∈ Im ).
probabilities. Additionally, we offer insight and intuition        We define the average confidence within bin Bm as
into network training and architectural trends that may                                            1 X
                                                                                  conf(Bm ) =                 p̂i ,
cause miscalibration. Finally, we compare various post-                                         |Bm |
                                                                                                        i∈Bm
processing calibration methods on state-of-the-art neural
networks, and introduce several extensions of our own.             where p̂i is the confidence for sample i. acc(Bm ) and
Surprisingly, we find that a single-parameter variant of Platt     conf(Bm ) approximate the left-hand and right-hand sides
scaling (Platt et al., 1999) – which we refer to as temper-        of (1) respectively for bin Bm . Therefore, a perfectly cal-
ature scaling – is often the most effective method at ob-          ibrated model will have acc(Bm ) = conf(Bm ) for all
taining calibrated probabilities. Because this method is           m ∈ {1, . . . , M }. Note that reliability diagrams do not dis-
straightforward to implement with existing deep learning           play the proportion of samples in a given bin, and thus can-
frameworks, it can be easily adopted in practical settings.        not be used to estimate how many samples are calibrated.


2. Definitions                                                     Expected Calibration Error (ECE). While reliability
                                                                   diagrams are useful visual tools, it is more convenient to
The problem we address in this paper is supervised multi-          have a scalar summary statistic of calibration. Since statis-
class classification with neural networks. The input X ∈ X         tics comparing two distributions cannot be comprehensive,
and label Y ∈ Y = {1, . . . , K} are random variables              previous works have proposed variants, each with a unique
that follow a ground truth joint distribution π(X, Y ) =           emphasis. One notion of miscalibration is the difference in
π(Y |X)π(X). Let h be a neural network with h(X) =                 expectation between confidence and accuracy, i.e.
(Ŷ , P̂ ), where Ŷ is a class prediction and P̂ is its associ-                   h                          i
                                                                                 E P Ŷ = Y | P̂ = p − p                    (2)
ated confidence, i.e. probability of correctness. We would                       P̂
like the confidence estimate P̂ to be calibrated, which in-        Expected Calibration Error (Naeini et al., 2015) – or ECE
tuitively means that P̂ represents a true probability. For         – approximates (2) by partitioning predictions into M
example, given 100 predictions, each with confidence of            equally-spaced bins (similar to the reliability diagrams) and
                        Varying Depth                 Varying Width                Using Normalization           Varying Weight Decay
                      ResNet - CIFAR-100           ResNet-14 - CIFAR-100           ConvNet - CIFAR-100          ResNet-110 - CIFAR-100
            0.7
                                        Error                        Error                          Error                           Error
            0.6
                                        ECE                          ECE                            ECE                             ECE
            0.5


Error/ECE
            0.4
            0.3
            0.2
            0.1
            0.0
                  0    20 40 60 80 100 120        0   50 100 150 200 250 300       Without     With           10−5    10−4    10−3      10−2
                            Depth                      Filters per layer           Batch Normalization                Weight decay

 Figure 2. The effect of network depth (far left), width (middle left), Batch Normalization (middle right), and weight decay (far right) on
 miscalibration, as measured by ECE (lower is better).

 taking a weighted average of the bins’ accuracy/confidence                    3. Observing Miscalibration
 difference. More precisely,
                                                                               The architecture and training procedures of neural net-
                            M
                            X |Bm |                                            works have rapidly evolved in recent years. In this sec-
                  ECE =                   acc(Bm ) − conf(Bm ) ,      (3)
                                    n                                          tion we identify some recent changes that are responsible
                           m=1
                                                                               for the miscalibration phenomenon observed in Figure 1.
 where n is the number of samples. The difference between
                                                                               Though we cannot claim causality, we find that model
 acc and conf for a given bin represents the calibration gap
                                                                               capacity and lack of regularization are closely related to
 (red bars in reliability diagrams – e.g. Figure 1). We use
                                                                               model (mis)calibration.
 ECE as the primary empirical metric to measure calibra-
 tion. See Section S1 for more analysis of this metric.
                                                                               Model capacity. The model capacity of neural networks
 Maximum Calibration Error (MCE). In high-risk ap-                             has increased at a dramatic pace over the past few years.
 plications where reliable confidence measures are abso-                       It is now common to see networks with hundreds, if not
 lutely necessary, we may wish to minimize the worst-case                      thousands of layers (He et al., 2016; Huang et al., 2016)
 deviation between confidence and accuracy:                                    and hundreds of convolutional filters per layer (Zagoruyko
                                                                             & Komodakis, 2016). Recent work shows that very deep
              max P Ŷ = Y | P̂ = p − p .             (4)                      or wide models are able to generalize better than smaller
                        p∈[0,1]
                                                                               ones, while exhibiting the capacity to easily fit the training
 The Maximum Calibration Error (Naeini et al., 2015) – or                      set (Zhang et al., 2017).
 MCE – estimates an upper bound of this deviation. Simi-
 larly to ECE, this approximation involves binning:                            Although increasing depth and width may reduce classi-
                                                                               fication error, we observe that these increases negatively
                  MCE =           max     |acc(Bm ) − conf(Bm )| .    (5)
                           m∈{1,...,M }                                        affect model calibration. Figure 2 displays error and ECE
 In reliability diagrams, MCE measures the largest calibra-                    as a function of depth and width on a ResNet trained on
 tion gap (red bars) across all bins, whereas ECE measures a                   CIFAR-100. The far left figure varies depth for a network
 weighted average of all gaps. For perfectly calibrated clas-                  with 64 convolutional filters per layer, while the middle left
 sifiers, MCE and ECE both equal 0.                                            figure fixes the depth at 14 layers and varies the number
                                                                               of convolutional filters per layer. Though even the small-
 Negative log likelihood is a standard measure of a prob-                      est models in the graph exhibit some degree of miscalibra-
 abilistic model’s quality (Friedman et al., 2001). It is also                 tion, the ECE metric grows substantially with model ca-
 referred to as the cross entropy loss in the context of deep                  pacity. During training, after the model is able to correctly
 learning (Bengio et al., 2015). Given a probabilistic model                   classify (almost) all training samples, NLL can be further
 π̂(Y |X) and n samples, NLL is defined as:                                    minimized by increasing the confidence of predictions. In-
                           n
                                                                               creased model capacity will lower training NLL, and thus
                          X
                   L=−        log(π̂(yi |xi ))             (6)                 the model will be more (over)confident on average.
                                        i=1
 It is a standard result (Friedman et al., 2001) that, in expec-               Batch Normalization (Ioffe & Szegedy, 2015) improves
 tation, NLL is minimized if and only if π̂(Y |X) recovers                     the optimization of neural networks by minimizing distri-
 the ground truth conditional distribution π(Y |X).                            bution shifts in activations within the neural network’s hid-
                                     NLL Overfitting on CIFAR−100               plays training error and ECE for a 110-layer ResNet with
                              45
                                                                 Test error     varying amounts of weight decay. The only other forms
                                                                 Test NLL       of regularization are data augmentation and Batch Normal-
                              40                                                ization. We observe that calibration and accuracy are not




   Error (%) / NLL (scaled)
                                                                                optimized by the same parameter setting. While the model
                              35                                                exhibits both over-regularization and under-regularization
                                                                                with respect to classification error, it does not appear that
                                                                                calibration is negatively impacted by having too much
                              30
                                                                                weight decay. Model calibration continues to improve
                                                                                when more regularization is added, well after the point of
                              25                                                achieving optimal accuracy. The slight uptick at the end of
                                                                                the graph may be an artifact of using a weight decay factor
                                                                                that impedes optimization.
                              20
                                0   100     200           300   400       500
                                                  Epoch
                                                                                NLL can be used to indirectly measure model calibra-
                                                                                tion. In practice, we observe a disconnect between NLL
Figure 3. Test error and NLL of a 110-layer ResNet with stochas-
tic depth on CIFAR-100 during training. NLL is scaled by a con-                 and accuracy, which may explain the miscalibration in Fig-
stant to fit in the figure. Learning rate drops by 10x at epochs 250            ure 2. This disconnect occurs because neural networks can
and 375. The shaded area marks between epochs at which the best                 overfit to NLL without overfitting to the 0/1 loss. We ob-
validation loss and best validation error are produced.                         serve this trend in the training curves of some miscalibrated
                                                                                models. Figure 3 shows test error and NLL (rescaled to
den layers. Recent research suggests that these normal-                         match error) on CIFAR-100 as training progresses. Both
ization techniques have enabled the development of very                         error and NLL immediately drop at epoch 250, when the
deep architectures, such as ResNets (He et al., 2016) and                       learning rate is dropped; however, NLL overfits during the
DenseNets (Huang et al., 2017). It has been shown that                          remainder of training. Surprisingly, overfitting to NLL is
Batch Normalization improves training time, reduces the                         beneficial to classification accuracy. On CIFAR-100, test
need for additional regularization, and can in some cases                       error drops from 29% to 27% in the region where NLL
improve the accuracy of networks.                                               overfits. This phenomenon renders a concrete explanation
While it is difficult to pinpoint exactly how Batch Normal-                     of miscalibration: the network learns better classification
ization affects the final predictions of a model, we do ob-                     accuracy at the expense of well-modeled probabilities.
serve that models trained with Batch Normalization tend to                      We can connect this finding to recent work examining the
be more miscalibrated. In the middle right plot of Figure 2,                    generalization of large neural networks. Zhang et al. (2017)
we see that a 6-layer ConvNet obtains worse calibration                         observe that deep neural networks seemingly violate the
when Batch Normalization is applied, even though classi-                        common understanding of learning theory that large mod-
fication accuracy improves slightly. We find that this result                   els with little regularization will not generalize well. The
holds regardless of the hyperparameters used on the Batch                       observed disconnect between NLL and 0/1 loss suggests
Normalization model (i.e. low or high learning rate, etc.).                     that these high capacity models are not necessarily immune
                                                                                from overfitting, but rather, overfitting manifests in proba-
Weight decay, which used to be the predominant regu-                            bilistic error rather than classification error.
larization mechanism for neural networks, is decreasingly
utilized when training modern neural networks. Learning                         4. Calibration Methods
theory suggests that regularization is necessary to prevent
overfitting, especially as model capacity increases (Vapnik,                    In this section, we first review existing calibration meth-
1998). However, due to the apparent regularization effects                      ods, and introduce new variants of our own. All methods
of Batch Normalization, recent research seems to suggest                        are post-processing steps that produce (calibrated) proba-
that models with less L2 regularization tend to generalize                      bilities. Each method requires a hold-out validation set,
better (Ioffe & Szegedy, 2015). As a result, it is now com-                     which in practice can be the same set used for hyperparam-
mon to train models with little weight decay, if any at all.                    eter tuning. We assume that the training, validation, and
The top performing ImageNet models of 2015 all use an or-                       test sets are drawn from the same distribution.
der of magnitude less weight decay than models of previous
years (He et al., 2016; Simonyan & Zisserman, 2015).                            4.1. Calibrating Binary Models
We find that training with less weight decay has a negative                     We first introduce calibration in the binary setting, i.e.
impact on calibration. The far right plot in Figure 2 dis-                      Y = {0, 1}. For simplicity, throughout this subsection,
we assume the model outputs only the confidence for the               model averaging. Essentially, BBQ marginalizes out all
positive class.1 Given a sample xi , we have access to p̂i –          possible binning schemes to produce q̂i . More formally, a
the network’s predicted probability of yi = 1, as well as             binning scheme s is a pair (M, I) where M is the number
zi ∈ R – which is the network’s non-probabilistic output,             of bins, and I is a corresponding partitioning of [0, 1] into
or logit. The predicted probability p̂i is derived from zi us-        disjoint intervals (0 = a1 ≤ a2 ≤ . . . ≤ aM +1 = 1). The
ing a sigmoid function σ; i.e. p̂i = σ(zi ). Our goal is to           parameters of a binning scheme are θ1 , . . . , θM . Under this
produce a calibrated probability q̂i based on yi , p̂i , and zi .     framework, histogram binning and isotonic regression both
                                                                      produce a single binning scheme, whereas BBQ considers
Histogram binning (Zadrozny & Elkan, 2001) is a sim-                  a space S of all possible binning schemes for the valida-
ple non-parametric calibration method. In a nutshell, all             tion dataset D. BBQ performs Bayesian averaging of the
uncalibrated predictions p̂i are divided into mutually ex-            probabilities produced by each scheme:2
                                                                                            X
clusive bins B1 , . . . , BM . Each bin is assigned a calibrated       P(q̂te | p̂te , D) =   P(q̂te , S = s | p̂te , D)
score θm ; i.e. if p̂i is assigned to bin Bm , then q̂i = θm . At                             s∈S
test time, if prediction p̂te falls into bin Bm , then the cali-                              X
brated prediction q̂te is θm . More precisely, for a suitably
                                                                                          =         P(q̂te | p̂te , S = s, D) P(S = s | D).
                                                                                              s∈S
chosen M (usually small), we first define bin boundaries
0 = a1 ≤ a2 ≤ . . . ≤ aM +1 = 1, where the bin Bm                     where P(q̂te | p̂te , S = s, D) is the calibrated probability
is defined by the interval (am , am+1 ]. Typically the bin            using binning scheme s. Using a uniform prior, the weight
boundaries are either chosen to be equal length intervals or          P(S = s | D) can be derived using Bayes’ rule:
to equalize the number of samples in each bin. The predic-                                            P(D | S = s)
tions θi are chosen to minimize the bin-wise squared loss:                      P(S = s | D) = P                      0
                                                                                                                        .
                                                                                                    s0 ∈S P(D | S = s )
                                                                      The parameters θ1 , . . . , θM can be viewed as parameters of
               X n
               M X
                                                          2           M independent binomial distributions. Hence, by placing
    min                  1(am ≤ p̂i < am+1 ) (θm − yi ) , (7)
  θ1 ,...,θM                                                          a Beta prior on θ1 , . . . , θM , we can obtain a closed form
               m=1 i=1
                                                                      expression for the marginal likelihood P(D | S = s). This
where 1 is the indicator function. Given fixed bins bound-
                                                                      allows us to compute P(q̂te | p̂te , D) for any test input.
aries, the solution to (7) results in θm that correspond to the
average number of positive-class samples in bin Bm .
                                                                      Platt scaling (Platt et al., 1999) is a parametric approach
                                                                      to calibration, unlike the other approaches. The non-
Isotonic regression (Zadrozny & Elkan, 2002), arguably                probabilistic predictions of a classifier are used as features
the most common non-parametric calibration method,                    for a logistic regression model, which is trained on the val-
learns a piecewise constant function f to transform un-               idation set to return probabilities. More specifically, in the
calibrated outputs; i.e. q̂i = f (p̂i ). Specifically, iso-           context of neural networks (Niculescu-Mizil & Caruana,
Pn regression produces
tonic                            f to minimize the square loss        2005), Platt scaling learns scalar parameters a, b ∈ R and
   i=1 (f (p̂i ) − y i )2
                          . Because f is constrained to be piece-     outputs q̂i = σ(azi + b) as the calibrated probability. Pa-
wise constant, we can write the optimization problem as:              rameters a and b can be optimized using the NLL loss over
                                                                      the validation set. It is important to note that the neural
                  M X
                  X n
                                                              2       network’s parameters are fixed during this stage.
       min                  1(am ≤ p̂i < am+1 ) (θm − yi )
         M
    θ1 ,...,θM   m=1 i=1
   a1 ,...,aM +1                                                      4.2. Extension to Multiclass Models
    subject to 0 = a1 ≤ a2 ≤ . . . ≤ aM +1 = 1,                       For classification problems involving K > 2 classes, we
                  θ1 ≤ θ2 ≤ . . . ≤ θM .                              return to the original problem formulation. The network
where M is the number of intervals; a1 , . . . , aM +1 are the        outputs a class prediction ŷi and confidence score p̂i for
interval boundaries; and θ1 , . . . , θM are the function val-        each input xi . In this case, the network logits zi are vectors,
                                                                                                (k)
ues. Under this parameterization, isotonic regression is a            where ŷi = argmaxk zi , and p̂i is typically derived using
strict generalization of histogram binning in which the bin           the softmax function σSM :
boundaries and bin predictions are jointly optimized.                                  exp(zi )
                                                                                                     (k)
                                                                      σSM (zi )(k) = PK        (j)
                                                                                                   ,             p̂i = max σSM (zi )(k) .
                                                                                                                         k
                                                                                      j=1 exp(zi )
Bayesian Binning into Quantiles (BBQ) (Naeini et al.,
2015) is a extension of histogram binning using Bayesian              The goal is to produce a calibrated confidence q̂i and (pos-
                                                                      sibly new) class prediction ŷi0 based on yi , ŷi , p̂i , and zi .
  1
    This is in contrast with the setting in Section 2, in which the
                                                                         2
model produces both a class prediction and confidence.                       Because the validation dataset is finite, S is as well.
      Dataset                Model      Uncalibrated   Hist. Binning   Isotonic   BBQ     Temp. Scaling   Vector Scaling   Matrix Scaling
       Birds             ResNet 50        9.19%           4.34%        5.22%      4.12%      1.85%             3.0%           21.13%
       Cars              ResNet 50         4.3%           1.74%        4.29%      1.84%      2.35%            2.37%            10.5%
    CIFAR-10            ResNet 110         4.6%           0.58%        0.81%      0.54%      0.83%            0.88%            1.0%
    CIFAR-10          ResNet 110 (SD)     4.12%           0.67%        1.11%       0.9%       0.6%            0.64%            0.72%
    CIFAR-10          Wide ResNet 32      4.52%           0.72%        1.08%      0.74%      0.54%             0.6%            0.72%
    CIFAR-10           DenseNet 40        3.28%           0.44%        0.61%      0.81%      0.33%            0.41%            0.41%
    CIFAR-10              LeNet 5         3.02%           1.56%        1.85%      1.59%      0.93%            1.15%            1.16%
    CIFAR-100           ResNet 110        16.53%          2.66%        4.99%      5.46%      1.26%            1.32%           25.49%
    CIFAR-100         ResNet 110 (SD)     12.67%          2.46%        4.16%      3.58%      0.96%            0.9%            20.09%
    CIFAR-100         Wide ResNet 32       15.0%          3.01%        5.85%      5.77%      2.32%            2.57%           24.44%
    CIFAR-100          DenseNet 40        10.37%          2.68%        4.51%      3.59%      1.18%            1.09%           21.87%
    CIFAR-100             LeNet 5          4.85%          6.48%        2.35%      3.77%      2.02%            2.09%           13.24%
     ImageNet          DenseNet 161       6.28%           4.52%        5.18%      3.51%      1.99%            2.24%               -
     ImageNet           ResNet 152        5.48%           4.36%        4.77%      3.56%      1.86%            2.23%               -
      SVHN            ResNet 152 (SD)     0.44%           0.14%        0.28%      0.22%      0.17%            0.27%            0.17%
     20 News              DAN 3           8.02%           3.6%         5.52%      4.98%      4.11%            4.61%            9.1%
     Reuters              DAN 3           0.85%           1.75%        1.15%      0.97%      0.91%            0.66%           1.58%
   SST Binary           TreeLSTM          6.63%           1.93%        1.65%      2.27%      1.84%            1.84%           1.84%
 SST Fine Grained       TreeLSTM          6.71%           2.09%        1.65%      2.61%      2.56%            2.98%           2.39%

Table 1. ECE (%) (with M = 15 bins) on standard vision and NLP datasets before calibration and with various calibration methods.
The number following a model’s name denotes the network depth.

Extension of binning methods. One common way of ex-                     T is called the temperature, and it “softens” the softmax
tending binary calibration methods to the multiclass setting            (i.e. raises the output entropy) with T > 1. As T → ∞,
is by treating the problem as K one-versus-all problems                 the probability q̂i approaches 1/K, which represents max-
(Zadrozny & Elkan, 2002). For k = 1, . . . , K, we form a               imum uncertainty. With T = 1, we recover the original
binary calibration problem where the label is 1(yi = k)                 probability p̂i . As T → 0, the probability collapses to a
and the predicted probability is σSM (zi )(k) . This gives              point mass (i.e. q̂i = 1). T is optimized with respect to
us K calibration models, each for a particular class. At                NLL on the validation set. Because the parameter T does
test time, we obtain an unnormalized probability vector                 not change the maximum of the softmax function, the class
   (1)           (K)          (k)
[q̂i , . . . , q̂i ], where q̂i is the calibrated probability for       prediction ŷi0 remains unchanged. In other words, temper-
class k. The new class prediction ŷi0 is the argmax of the             ature scaling does not affect the model’s accuracy.
vector, and the new confidence q̂i0 is the max of the vector            Temperature scaling is commonly used in settings such as
                     PK (k)
normalized by k=1 q̂i . This extension can be applied