356
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Boldness-Recalibration for Binary Event Predictions

&
Received 04 May 2023, Accepted 28 Mar 2024, Published online: 13 May 2024

Abstract

Probability predictions are essential to inform decision making across many fields. Ideally, probability predictions are (i) well calibrated, (ii) accurate, and (iii) bold, that is, spread out enough to be informative for decision making. However, there is a fundamental tension between calibration and boldness, since calibration metrics can be high when predictions are overly cautious, that is, non-bold. The purpose of this work is to develop a Bayesian model selection-based approach to assess calibration, and a strategy for boldness-recalibration that enables practitioners to responsibly embolden predictions subject to their required level of calibration. Specifically, we allow the user to pre-specify their desired posterior probability of calibration, then maximally embolden predictions subject to this constraint. We demonstrate the method with a case study on hockey home team win probabilities and then verify the performance of our procedures via simulation. We find that very slight relaxation of calibration probability (e.g., from 0.99 to 0.95) can often substantially embolden predictions when they are well calibrated and accurate (e.g., widening hockey predictions’ range from 26%–78% to 10%–91%).

1 Introduction

Probability predictions are made for everyday events, from the mundane, like the probability it will rain, to the life-altering, like the probability that a natural disaster hits a particular city. These predictions arise from both sophisticated statistical and machine learning techniques and/or simply from human judgment and expertise. Regardless, probability predictions are commonly used in important decision making processes in the fields of medicine, economics, image recognition via machine learning, sports analytics, entertainment and many others, so it is critical that we have methods that assess such predictions. The purpose of this article is to develop boldness-recalibration that enables forecasters to achieve well calibrated, responsibly bold probability predictions for binary events.

We describe the assessment of probability predictions in terms of three aspects: calibration, accuracy, and boldness. The first aspect is calibration. Predicted probabilities are well calibrated when the events they aimed to predict occur with the same probability that was forecasted. For example, if the home team wins 40% of games for which a hockey forecaster predicted a 40% chance of a win, the forecaster is well calibrated. Probability calibration is well studied within the fields of statistics, meteorology, psychology, machine learning, and others (Bross Citation1953; Murphy and Winkler Citation1977; Dawid Citation1982; DeGroot and Fienberg Citation1983; Gonzalez and Wu Citation1999; Guo et al. Citation2017). Naturally, calibration is considered a minimal desirable property of predicted probabilities (Dawid Citation1982). Without calibration, if a forecaster says the probability of a home team win is 70%, you cannot rely on that prediction to reflect the true probability of a win. However, well calibrated predictions are not necessarily accurate nor bold enough to be useful.

The second aspect to assess probability predictions is classification accuracy. Classification accuracy measures how well predictions distinguish between the events they aim to predict. Receiver-Operating Characteristic (ROC) curves and the corresponding area under the ROC curve (AUC) are frequently used to assess classification accuracy for probability predictions. These accuracy assessments do not measure calibration since any monotone transformation applied to forecaster probability predictions will produce the same ROC curve and AUC as the original predictions.

The third aspect to assess probability predictions is boldness. We define boldness simply as the spread (i.e., standard deviation) in probability predictions. To illustrate, in National Hockey League (NHL) games in the 2020–21 season, the home team won 53% of games, thus, the sample proportion is y¯= 0.53. A hockey forecaster who simply predicts the base rate of 0.53 for every NHL game is well calibrated, but lacks the boldness needed for actionable predictions. However, forecasters who produce bold predictions alone without calibration or good classification show they have misplaced confidence in their prediction ability.

This brings into focus the core tension between calibration and boldness, subject to classification accuracy. Notice the star in at the center of the three aspects. This intersection is where forecaster predictions are accurate, well calibrated, reasonably bold, and thus, actionable for decision making. It is important to note that the level of boldness considered “reasonable” depends directly on classification accuracy and the overall decision making goal. In some cases (such as forecasting rain), maintaining the highest level of calibration achievable may be most important. Increasing boldness here may be useful so that event planners can consider well informed “worst case” weather predictions for a specific day. In other settings, poor classification accuracy (e.g., due to badly informed forecasters) may limit the amount of emboldening that is responsible. Sacrificing a small amount of calibration for greater boldness allows the analyst to responsibly examine riskier predictions in a variety of areas where investments of time, effort, and/or money are called for (e.g., sports betting, medical diagnostics, financial investing, hiring employees, etc.). This article allows the analyst to examine emboldened probability predictions in the context of a user-specified requirement for calibration using Bayesian reasoning.

Fig. 1 Venn Diagram highlighting the possible combinations of three aspects of probability predictions: calibration, boldness, and classification accuracy. We propose a boldness-recalibration approach that enables forecasters to maximize boldness while maintaining a high probability of calibration, subject to their classification accuracy. The star represents probability predictions that are well calibrated, bold, and accurate. Empty set indicates that forecasts that are not accurate cannot be both well calibrated and bold.

Fig. 1 Venn Diagram highlighting the possible combinations of three aspects of probability predictions: calibration, boldness, and classification accuracy. We propose a boldness-recalibration approach that enables forecasters to maximize boldness while maintaining a high probability of calibration, subject to their classification accuracy. The star represents probability predictions that are well calibrated, bold, and accurate. Empty set ∅ indicates that forecasts that are not accurate cannot be both well calibrated and bold.

Several techniques exist that purely focus on correcting miscalibration. Dalton (Citation2013) leverages the Cox linear-logistic model to test for calibration and proposes a relative calibration metric, but makes no mention of prediction boldness. Platt (Citation2000) introduces Platt scaling which recalibrates Support Vector Machine (SVM) output via sigmoid curves. Guo et al. (Citation2017) proposes recalibration via temperature scaling, a one-parameter extension of Platt scaling for Neural Network output. Zadrozny and Elkan (Citation2001) propose a nonparametric approach called histogram binning where probabilities are bin-wise recalibrated to minimize squared loss. Zadrozny and Elkan (2002) extends this by fitting a piece-wise recalibration function on each bin interval. Naeini, Cooper, and Hauskrecht (Citation2015) extends this further to Bayesian Binning into Quantiles (BBQ) where multiple binning strategies are considered via Bayesian model averaging. However, none of these methods incorporate boldness in their adjustment.

Some approaches to assessing calibration also consider notions similar to boldness. Reliability diagrams plot the predicted forecaster probabilities versus the observed frequency within each bin (Murphy and Winkler Citation1977). A calibration metric called Expected Calibration Error (ECE) quantifies the miscalibration seen in reliability diagrams by averaging the distances between the predictions and observed frequencies within each bin. Sometimes histograms (Ranjan and Gneiting Citation2010; Dimitriadis, Gneiting, and Jordan Citation2021) or density plots (Satopää Citation2022) of the predicted probabilities are included with reliability diagrams to visualize boldness. However, boldness is not quantified in this approach.

A common metric for prediction accuracy, the Brier Score, can be decomposed into three parts such that one component measures calibration, another measures resolution, and the last measures the uncertainty in the outcomes themselves (Brier Citation1950; Murphy Citation1973). Resolution (or discrimination) refers to how well forecasts distinguish between the two possible outcomes. Predicted probabilities have high resolution when they are further from the base rate. While resolution is similar in concept to boldness, they are not mathematically equivalent. We will show that boldness is measured independently of the event outcomes, where measures of resolution rely on the base rate, and thus are not fully disentangled from the overall uncertainty of event outcomes.

A few methods both recalibrate and embolden predictions in highly specific circumstances. Lichtendahl et al. (Citation2022) and Satopää (Citation2022) focus on aggregates of forecaster predictions and their spread but they specifically focus on a forecast aggregation approach that is not applicable to individual forecasters. We focus on appropriately emboldening predictions from a single forecaster subject to their calibration and classification accuracy. The predictions in the case study we present are not aggregate forecasts, and thus the approaches of Lichtendahl et al. (Citation2022) and Satopää (Citation2022) are not applicable here. Turner et al. (Citation2014), Baron et al. (Citation2014), and Atanasov et al. (Citation2017) also focus on aggregates, using the linear log odds (LLO) recalibration function to adjust aggregate boldness. Roitberg et al. (Citation2022) employs a network based temperature scaling approach to recalibrate and correct overly bold softmax pseudo-probabilities. However, Turner et al. (Citation2014), Baron et al. (Citation2014), Atanasov et al. (Citation2017), and Roitberg et al. (Citation2022) all rely on the Brier Score and/or ECE to assess calibration. Han and Budescu (Citation2022) focus on LLO applied to forecasts of continuous, rather than binary, events. Gonzalez and Wu (Citation1999) use LLO to recalibrate single forecaster predictions but focus solely on the psychological implications of probability perception for binary events. None of these methods provide direct control of the calibration-boldness tradeoff.

To the best of our knowledge, no methodology yet exists that provides a mechanism to directly control the tradeoff between calibration and boldness. To address this gap in the literature, we propose boldness-recalibration. Boldness-recalibration allows users to set the desired level of calibration in terms of the posterior calibration probability and then maximizes boldness by maximizing spread in predictions subject to calibration level. Three key virtues of this approach are that it (a) quantifies the calibration-boldness tradeoff in an interpretable manner (in a Bayesian sense), (b) is forecaster agnostic, meaning it operates only on probability and event data, not on how the forecaster made the predictions, and (c) does not rely on binning. Additionally, we have developed the R package BRcal in order to make boldness-recalibration readily available for analysts. The package can be found at https://github.com/apguthrie/BRcal with a subsequent CRAN release planned.

The rest of this article is organized as follows. Section 2 introduces boldness-recalibration methodology. A real-world case study involving hockey predictions and a simulation study are presented in Sections 3 and 4, respectively. Section 5 provides a discussion and concluding comments.

2 Methods

The following approaches are forecaster agnostic, meaning they can be applied to any probability forecasts of binary events produced by forecasters from many domains, regardless of how the predictions were made. By forecaster, we mean any entity that produces probability predictions, regardless if they are machine learning output and/or a product of human judgment and expertise.

2.1 Linear Log Odds (LLO) Recalibration Function

To assess calibration, we use the linear log odds (LLO) recalibration function. Let c(xi;δ,γ) be the LLO function (1) c(xi;δ,γ)=δxiγδxiγ+(1xi)γ,(1) where xi is a probability prediction from a forecaster, δ>0 and γR. We call the outputted probability, c(xi;δ,γ), the LLO-adjusted probability. The LLO-adjusted set is based on shifting and scaling each of the original forecaster probabilities xi on the log odds scale using δ and γ. Thus, on the log odds scale, the LLO-adjusted set is linear with respect to xi according to intercept log(δ) and slope γ, and can be rewritten as (2) log(c(xi;δ,γ)1c(xi;δ,γ))=γlog(xi1xi)+log(δ).(2)

Suitable choices of δ and γ can calibrate poorly calibrated probabilities. The flexibility of the LLO function can capture many forms of miscalibration (Gonzalez and Wu Citation1999; Turner et al. Citation2014). When both δ = 1 and γ = 1, the LLO function imposes no shifting nor scaling, returning the original prediction xi (Gonzalez and Wu Citation1999). Thus, null values of δ0=γ0=1 corresponds to the hypothesis that xi is well calibrated. This is similar to how Reliability Diagrams operate in that when predicted forecaster probabilities are close to the observed frequency within each bin (i.e., forecasts are well calibrated), the result resembles the x = y line. The same is true when plotting event rates by LLO-adjusted probability forecasts via δ = 1 and γ = 1 under calibration. It is important to note that if c(xi;δ,γ) is considered the LLO-adjusted “event” probability, the corresponding LLO-adjusted “non-event” probability is 1c(xi;δ,γ) rather than c(1xi;δ,γ), assuming event outcomes are binary. Additionally, note that if the definitions of “event” and “non-event” are reversed (e.g., take event as home team loss instead of home team win), we arrive at fundamentally the same conclusion.

2.2 Likelihood Function

We adopt a Bernoulli likelihood where the events are presumed independent and the probability of each event is governed by LLO-adjusted probabilities. Let y be a vector of n binary outcomes corresponding to the n predictions in x from a single forecaster. Then, we have (3) π(x,y|δ,γ)=i=1nc(xi;δ,γ)yi[1c(xi;δ,γ)]1yi.(3)

This likelihood enables calibration maximization via maximum likelihood estimates (MLEs) for δ and γ. The δ̂MLE and γ̂MLE values produce optimally calibrated probabilities, c(xi;δ̂MLE,γ̂MLE). Shifting via δ̂MLE on the log odds scale adjusts the average prediction to match the sample proportion. Scaling by γ̂MLE on the log odds spreads out or contracts predictions based on accuracy. This may be a desirable approach when probability calibration is the sole priority. Our approach of adopting a Bernoulli Likelihood governed by LLO-adjusted probabilities is equivalent to a specialized logistic regression model. Details can be found in the online supplement.

2.3 Bayesian Assessment of Calibration

Using the likelihood function in the previous section, we take a Bayesian model selection-based approach to calibration assessment. We compare a well calibrated model, Mc (where δ=γ=1), to an uncalibrated model, Mu (where δ>0,γR). The posterior model probability of Mc given the observed outcomes y serves as our measure of calibration for the testing framework and can be expressed as (4) P(Mc|y)=P(y|Mc)P(Mc)P(y|Mc)P(Mc)+P(y|Mu)P(Mu).(4)

Here, P(y|Mi) is the integrated likelihood of the observed outcomes y given Mi and P(Mi) is the prior probability of model i, i{c,u}. The Bayes Factor comparing the uncalibrated model to the calibrated model is defined as (5) BF=P(y|Mu)P(y|Mc).(5)

Inverting (4) gives us (6) 1P(Mc|y)=P(y|Mc)P(Mc)P(y|Mc)P(Mc)+P(y|Mu)P(Mu)P(y|Mc)P(Mc)(6) (7) =1+BFP(Mu)P(Mc).(7)

Thus, the expression in (4) can be rewritten as (8) P(Mc|y)=11+BFP(Mu)P(Mc).(8)

An essential component of Bayesian model selection is the specification of prior model probabilities. To the best of our knowledge, this is the first attempt to assess calibration probability through Bayesian model selection, and thus best practices for setting P(Mc) and P(Mu) have not yet been established. We set these prior probabilities to 12 in subsequent analyses for illustrative purposes only. The BRcal package allows users to set alternate model priors.

Using the likelihood in (3), the integrated likelihoods, P(y|Mi), are not analytically tractable. While a fully Bayesian approach could be implemented, we advocate for a useful approximation. We employ a large sample approximation to the Bayes factor using the Bayesian Information Criteria (BIC) such that (9) BFexp{12(BICuBICc)}(9) to form the posterior model probability in (8). See Kass and Raftery (Citation1995) and Kass and Wasserman (Citation1995) for more information about this approximation. Here, the BIC under the well calibrated model Mc is defined as (10) BICc=2×log(π(δ=1,γ=1|x,y)).(10)

The penalty term for number of estimated parameters is omitted in (10) as both parameters are fixed at 1 under Mc. The BIC under poorly calibrated model Mu is defined as (11) BICu=2×log(n)2×log(π(δ̂MLE,γ̂MLE|x,y)).(11)

With this approximation for BF, we form P(Mc|y), which can be interpreted as the probability the set of forecasts x is well calibrated given the observed data y. Again, P(Mc|y) corresponds to calibration as δ=γ=1 implies events happen at the rate forecasted with no further adjustment. The interpretability of the posterior model probability, P(Mc|y), is the key feature of this Bayesian test for calibration. By quantifying the calibration of probability forecasts with a readily interpretable metric, we enable easier comparison of forecasters in terms of calibration and more informed decision making. We posit that P(Mc|y) is interpretable to the extent that the Bayesian posterior probabilities that condition on observed data are interpretable. For a frequentist approach to assessing calibration using (3), see the Likelihood Ratio test presented in the online supplement.

2.4 Boldness-Recalibration

The previous Bayesian model selection approach assesses calibration alone with no regard for boldness. We now consider the boldness of predicted probabilities measured by their spread (i.e., standard deviation) (12) sb=sd(x).(12)

The goal of boldness-recalibration is to maximize sb, or boldness of predictions subject to a user-specified constraint on the calibration probability, P(Mc|y). To accomplish this we let the user set the calibration level, t, that P(Mc|y) must achieve. For example, if we want to ensure our recalibrated probabilities had at least a 95% posterior probability of calibration, we would set t = 0.95.

Then boldness (sb) is maximized subject to P(Mc|y)=0.95. We call xt the (100 * t)% Boldness-Recalibration set where xi,t=c(xi;δ̂t,γ̂t) and (13) (δ̂t,γ̂t)=argmax(δ̂,γ̂)(sb:P(Mc|y,δ̂,γ̂)t).(13)

To visualize the process of boldness-recalibration, consider the two schemas in . The panel on the left depicts predictions that vary in boldness. The “less bold” predictions are closer to the base rate y¯. The “more bold” predictions arise by moving the original predictions away from the base rate, and thus, increasing spread.

Fig. 2 Schemas to visualize boldness-recalibration. The left panel shows boldness as a function of spread in predictions. Each line corresponds to a prediction. The right panel shows a boldness-recalibration contour plot where the x-axis is shift parameter δ, y-axis is scale parameter γ, and z-axis is P(Mc|y) achieved by δ and γ. Contours correspond to P(Mc|y) = 0.95 (solid red), 0.9 and 0.8 (dashed black). The × corresponds to (δ̂MLE, γ̂MLE) such that the resulting probabilities under LLO-adjustment have maximal probability of calibration. The star on the 0.95 contour corresponds to (δ̂0.95, γ̂0.95) such that the resulting probabilities have maximal spread subject to 95% calibration. These LLO-adjusted probabilities are called the 95% boldness-recalibration set.

Fig. 2 Schemas to visualize boldness-recalibration. The left panel shows boldness as a function of spread in predictions. Each line corresponds to a prediction. The right panel shows a boldness-recalibration contour plot where the x-axis is shift parameter δ, y-axis is scale parameter γ, and z-axis is P(Mc|y) achieved by δ and γ. Contours correspond to P(Mc|y) = 0.95 (solid red), 0.9 and 0.8 (dashed black). The × corresponds to (δ̂MLE, γ̂MLE) such that the resulting probabilities under LLO-adjustment have maximal probability of calibration. The star on the 0.95 contour corresponds to (δ̂0.95, γ̂0.95) such that the resulting probabilities have maximal spread subject to 95% calibration. These LLO-adjusted probabilities are called the 95% boldness-recalibration set.

The panel on the right of shows a boldness-recalibration contour plot. This plot is used in the case study to show P(Mc|y) across a grid of LLO-adjustment parameters δ and γ. Rather than focus solely on where P(Mc|y) is high (i.e., high calibration), we can draw a contour at P(Mc|y)=t to focus on our user specified level of calibration. Then, along that contour we identify the δ and γ that maximize spread in the LLO-adjusted probabilities via a grid-search based approach. The δ and γ values corresponding to the star indicate precisely how to use (1) to embolden predictions subject to t. In , we identify these parameters with the star along the red contour at t = 0.95. We call the LLO-adjusted probabilities under these parameters the 95% boldness-recalibration set.

2.5 Other Methods to Assess Calibration

We report the Brier Score and Expected Calibration Error for the examples in this article.

2.5.1 Brier Score Calibration Component

For binary events, the Brier Score takes on the form (14) BS=1ni=1n(xiyi)2(14) where xi is the predicted probability for event i and yi is the binary outcome (0 if a non-event, 1 if event). The Brier Score in (14) can take on any value from 0 to 1, where lower values are better.

The Brier Score can be decomposed as follows: (15) BS=1nk=1Knk(xky¯k)21nk=1Knk(y¯ky¯)2+y¯(1y¯)(15) where the vector of probabilities x is binned into K bins, xk is the average prediction for bin k, y¯k is the relative frequency of events corresponding to the observations in bin k, y¯ is the overall base rate, nk is the number of observations within bin k, and n is the total number of predictions (Murphy Citation1973). The first addend on the right hand side of (15) is a measure of calibration, which we will refer to as Brier Score Calibration (BSC), and is the measure we will compare to P(Mc|y). The second addend on the right hand side of (15) is a measure of resolution, which we will abbreviate as BSR, and is a measure we will compare to sb. The third addend is a measure of uncertainty in the outcomes, which we will abbreviate as BSU. Lower values of BSC are better, with BSC = 0 indicating perfect calibration. Higher values of BSR are better, with BSR = BSU indicating perfect resolution

2.5.2 Expected Calibration Error (ECE)

For binary events, Expected Calibration Error (ECE) takes on the form (16) ECE=k=1Knkn|y¯kx¯k|(16) where x is binned into K bins, nk is the number of predictions in bin k, y¯k is the proportion of observed events in bin k, and x¯k is the average probability prediction in bin k. ECE can take on any value from 0 to 1, where lower values are better.

3 Hockey Home Team Win Predictions Case Study

To demonstrate the capabilities of boldness-recalibration, we present that following case study related to hockey home team win predictions.

3.1 Data

We assembled data from FiveThirtyEight that pertain to the 2020–21 National Hockey League (NHL) Season. FiveThirtyEight produced predicted probabilities for all 868 regular season games that season via modeling with carefully constructed components based on expert knowledge of the game of hockey. These predictions were furnished prospectively pre-game, with no in- or post-game updating. FiveThirtyEight probabilities are potentially hedged toward the base rate of 0.53 with an inter-quartile range of 0.12 (0.47, 0.59), their full range being (0.26, 0.77). More detailed information about this dataset can be found in the online supplement.

In addition to this real-world forecaster, we generated a set of 868 random probability predictions to represent a hockey forecaster who is completely uninformed about the NHL games they aimed to predict. We call this forecaster our “random noise forecaster.” To mimic this behavior and better enable comparability, our random noise forecaster is generated by taking random uniform draws from 0.26 to 0.77, the observed range in the FiveThirtyEight data. The purpose of the random noise forecaster is to demonstrate how boldness recalibration operates when predictions are unrelated to the events they predict. We want to ensure our method does not blindly embolden inaccurate forecasts.

3.2 Results

We applied boldness-recalibration to the two Hockey forecasters at three specified levels of calibration: t=0.95,0.9, and 0.8. shows the boldness-recalibration plots for FiveThirtyEight (Left) and random noise forecaster (Right). Regions in red show where P(Mc|y) is high for the LLO-adjusted x via the corresponding δ (x-axis) and γ (y-axis) values. Regions in blue show where P(Mc|y) is low. As expected, δ̂MLE and γ̂MLE, marked by the white × in , lie at the point where the probability of calibration is maximized. The values for δ̂t and γ̂t are marked by white points along the contour for each t. Recall these represent the set of LLO-adjustment parameters for which maximal boldness is achieved with a probability of calibration of at least t. These parameter values, along with the achieved P(Mc|y), sb, prediction range, Brier Score, BSC, BSR, ECE, and AUC are summarized in . Note that for BSC, BSR, and ECE the probabilities x are binned into 10 equal-width bins across the range [0,1].

Fig. 3 Boldness-recalibration contour plots for FiveThirtyEight (Left) and random noise forecaster (Right). Regions in red reflect high P(Mc|y) for the LLO-adjusted x via corresponding δ (x-axis) and γ (y-axis) values. Regions in blue show low P(Mc|y). The × marks δ̂MLE and γ̂MLE where the probability of calibration is maximized. Contours at t=0.95,0.9, and 0.8 are drawn in white and δ̂t and γ̂t are marked by white points along each contour.

Fig. 3 Boldness-recalibration contour plots for FiveThirtyEight (Left) and random noise forecaster (Right). Regions in red reflect high P(Mc|y) for the LLO-adjusted x via corresponding δ (x-axis) and γ (y-axis) values. Regions in blue show low P(Mc|y). The × marks δ̂MLE and γ̂MLE where the probability of calibration is maximized. Contours at t=0.95,0.9, and 0.8 are drawn in white and δ̂t and γ̂t are marked by white points along each contour.

Table 1 Values of the posterior model probability of calibration P(Mc|y), boldness (sb), prediction range, Brier Score (BS), Brier Score calibration component (BSC), Brier Score resolution component (BSR), expected calibration error (ECE), and area under the ROC curve (AUC) for the original sets of predictions and those achieved under MLE recalibration, 95%, 90%, and 80% boldness-recalibration (B-R) via estimated adjustment parameters δ̂ and γ̂ for FiveThirtyEight and random noise forecaster.

After deploying boldness-recalibration on the two sets of predictions, we see a substantial increase in boldness for FiveThirtyEight. shows how the predictions for FiveThirtyEight (Left) and the random noise forecaster (Right) change under LLO-adjustments via MLEs and boldness recalibration. The first column of points in each panel represents the original set of probability predictions. The second column of points represents the predictions after recalibrating with the MLEs. The third, fourth, and fifth columns of points represent the predictions after 95%, 90%, and 80% boldness-recalibration, respectively. A line is used to connect each original prediction to where it ends up after each recalibration procedure. Points and lines colored blue correspond to predictions for games in which the home team won. Red corresponds to games in which the home team lost. The posterior model probability of calibration is reported in the parentheses in the axis label. Note that the posterior model probabilities are not necessarily linear from left to right. We order the sets in this way, starting with the original forecasters sets, to make consistent comparisons throughout the results of the article.

Fig. 4 Lineplot visualizing how the predictions for FiveThirtyEight (Left) and the random noise forecaster (Right) change under LLO-adjustments via MLEs and boldness recalibration. The first column of points in each panel is the original set of probability predictions. The second column of points is the predictions after recalibrating with the MLEs. The last three columns are the predictions after 95%, 90%, and 80% boldness-recalibration, respectively. A line is used to connect each original prediction to where it ends up after each recalibration procedure. Points and lines colored blue correspond to predictions for games in which the home team won. Red corresponds to games in which the home team lost. Achieved P(Mc|y) is reported in the parentheses in the axis label.

Fig. 4 Lineplot visualizing how the predictions for FiveThirtyEight (Left) and the random noise forecaster (Right) change under LLO-adjustments via MLEs and boldness recalibration. The first column of points in each panel is the original set of probability predictions. The second column of points is the predictions after recalibrating with the MLEs. The last three columns are the predictions after 95%, 90%, and 80% boldness-recalibration, respectively. A line is used to connect each original prediction to where it ends up after each recalibration procedure. Points and lines colored blue correspond to predictions for games in which the home team won. Red corresponds to games in which the home team lost. Achieved P(Mc|y) is reported in the parentheses in the axis label.

First, notice that the probability of calibration given the event outcomes for the original FiveThirtyEight forecasts is very high at 0.9904, whereas the probability for the random noise forecaster rounds down to 0.000. This indicates that FiveThirtyEight is well calibrated to begin with and, as we would expect, the random noise forecaster is not. Next, notice that by maximizing P(Mc|y), the range of FiveThirtyEight’s predictions expands from (0.26, 0.77) to (0.18, 0.84) and sb increases from 0.091 to 0.124 as seen in . FiveThirtyEight can achieve a maximal probability of calibration of 0.9988. In contrast, for the random noise forecaster to achieve their maximal calibration of 0.9988, they must pull their predictions in toward the base rate of 0.53. Their prediction range contracts from (0.26, 0.77) to (0.51, 0.56) and sb drops from 0.146 to 0.011. Not only does this imply the random noise forecaster is poorly calibrated, but it also suggests that their predictions do not have useful predictive information. We know this to be true because these predictions were randomly generated with no association with the outcome.

Now compare the spread of original predictions to the spread of the 95% boldness-recalibration set. FiveThirtyEight can further embolden their predictions by accepting a 5% risk of mis-calibration, expanding their range to (0.10, 0.90). This suggests that FiveThirtyEight could embolden predictions with a modest decrease in P(Mc|y), where the random noise forecaster has no knowledge of the outcome and should make far more cautious calls. In this example, there is minimal gain in boldness moving from 95% B-R to 90% or 80%. It is up to the discretion of the user to determine if accepting an additional risk of 5% or 10% risk of miscalibration is worth the minimal reduction in boldness. Regardless, boldness-recalibration successfully increases the boldness of our skilled hockey forecaster while maintaining a user-specified level of calibration. For our random noise forecaster, boldness-recalibration suggests that increasing boldness would not be responsible, and instead contracts predictions.

In terms of the Brier Score, FiveThirtyEight and the random noise forecaster achieve scores of 0.2346 and 0.2675, respectively. It is hard to say how this practically translates to how much “better” FiveThirtyEight is compared to the random noise forecaster and what a “good” Brier Score is for this application. Despite the substantial increase in sb and prediction range for FiveThirtyEight, the Brier Score shows very little change, improving by 0.001 under MLE recalibration and an additional 0.002 under 95% B-R. The BSC is the same for the original and 95% B-R sets and BSR improves by 0.003. In contrast, BSC for the random noise forecaster drops to near zero after MLE recalibration and then worsens to 0.002 under 95% B-R. The BSR worsens after MLE recalibration and remains at 0.000, the worst achievable score for BSR, for all B-R sets. This further reflects that the random noise forecaster can only improve calibration by reducing boldness and resolution. We see that ECE is minimized when calibration is maximized for both forecasters and worsens by 0.017 under 95% B-R for FiveThirtyEight and by 0.028 for the random noise forecaster.

In terms of classification accuracy, FiveThirtyEight produces an AUC of 0.65. This implies their predictions are better than chance and provide some information in classifying a home team win. Our random noise forecaster produces an AUC of 0.51, which is very close to the underlying 0.5 we would expect as this forecaster makes predictions completely via random chance. Notice that AUC stays the same across all sets for both forecasters. This is due to the fact that the LLO function is a monotonic function and the ordering of predictions does not change and neither does AUC.

4 Simulation Study

To further assess the properties of boldness-recalibration, we present the following simulation study.

4.1 Data

shows the four forecaster types in our simulation study. The data generation process for our simulation follows:

Table 2 Values of LLO parameters (δ, γ) and Prelec parameters (α, β) under which each forecaster type is simulated along with description of forecaster type.

  1. Generate n true event outcomes via random independent Bernoulli draws, where the probability of success at each draw takes on a random uniform draw from 0 to 1.

    piUniform(0,1)yiBernoulli(pi)i=1,,n

    The pis make up the well calibrated forecaster predictions by construction, as they directly correspond to the true probability of each event outcome.

  2. To manipulate classification accuracy, add varying amounts of random noise, vi, to each pi on the log odds scale, which is equivalent to

    pi,σ=evipi(1pi)+evipi

    where pi,σ is the set of noisy probabilities and viN(0,σ2), σ{0,0.1,0.5,1,2}.

  3. To manipulate boldness and create the four forecaster types, LLO-adjust pi,σ under varying δ and γ values, summarized in . Since the LLO function is monotone, forecasters LLO-adjusted from pi,σ maintain the same classification accuracy as pi,σ.

The first forecaster type, called Well Calibrated, represents forecasters whose predictions correspond to the true event rate. Notice in that these predictions are LLO-adjusted under δ=γ=1, so the Well Calibrated Probabilities are equivalent to the perfectly calibrated probabilities with added noise, that is pi,σwc=pi,σ. Thus, under σ = 0, pi,0wc=pi, the perfectly calibrated probabilities.

Our second forecaster type is called Hedger. The Hedger compresses probabilities around the base rate, 0.5 in this case. We call their predictions “hedged” as they reflect forecasters who are lacking boldness even though their accuracy could be high. In contrast, our third forecaster type, Boaster, represents a forecaster who exhibits excessive boldness. The majority of their predictions are far from the base rate and very close to the extremes of 0 and 1. The fourth forecaster type is Biased. These forecasters systematically make predictions that are higher or lower than the event rate.

While this simulation focuses on miscalibration simulated via LLO-adjustment, we also explore miscalibration simulated from Prelec’s two parameter function: (17) w(x;α,β)=eβ(log(x))α(17) where α>0 and β>0 (Prelec Citation1998). We follow the same simulation procedure as shown above, except using (17) rather than LLO in Step 3. Similar to LLO-adjustment, the Well Calibrated forecaster is Prelec-adjusted under α=β=1. The other forecaster parameter values for α and β are summarized in . Note the original formulation of this function in Prelec (Citation1998) limits 0<α<1, we allow α1 as the function provides valid probabilities under these settings. Given that our methodology assumes miscalibration follows the LLO function which likely will not hold in all scenarios, we simulate miscalibration via the Prelec function to assess how well our methodology does under miscalibration misspecification.

To explore the effect of sample size on our methodology, we generated datasets of size n= 30, 100, 800, 2000, and 5000. In total, we present results from 100 Monte Carlo (MC) replicates for each value of n. Throughout the study, one MC replicate consists of a set of n outcomes, and a corresponding set of n predicted probabilities for each of the four forecasters types under each of the five noise settings and both LLO and Prelec adjustment (35 total predicted probabilities sets for each replicate). For each of the 35 sets, we (i) LLO-adjust via MLEs δ̂MLE, γ̂MLE, (ii) 95%, 90%, and 80% boldness-recalibrate, and then (iii) evaluate the calibration and boldness of the adjusted sets from (i) and (ii).

4.2 Simulation Results

The posterior model probability of calibration, P(Mc|y), prior to MLE recalibration and boldness-recalibration is summarized in for all 100 MC runs. The boxplots are grouped by simulated forecaster type shown on the x-axis. The y-axis shows the value of P(Mc|y). Sample size increases with vertical panels from top to bottom. Horizontal panels indicate whether miscalibration was simulated under the LLO or the Prelec function. As the Well Calibrated forecasters do not change under either LLO or Prelec adjustment, they are separated out into their own panel. Within each group of boxplots, added noise σ increases from left to right. Thus, for the Well Calibrated group, only the first boxplot with no added noise (σ = 0) is perfectly calibrated and both calibration and accuracy decrease as noise increases. Boxplots for sb, BS, BSR, BSC, and ECE can be found in the online supplement.

Fig. 5 Boxplots summarizing the posterior model probability of calibration, P(Mc|y), on the y-axis for 100 MC runs on simulated forecasters. Boxplots are grouped by forecaster type on the x-axis. Within groups, added noise increases from left to right. Only the leftmost boxplot in the Well Calibrated group is perfectly calibrated, and ← indicates calibration increases as noise decreases. Horizontal panels indicate which adjustment (if any) was applied to create the forecaster type. As sample size increases with vertical panels, P(Mc|y) decreases for all forecasters except Well Calibrated with little to no added noise.

Fig. 5 Boxplots summarizing the posterior model probability of calibration, P(Mc|y), on the y-axis for 100 MC runs on simulated forecasters. Boxplots are grouped by forecaster type on the x-axis. Within groups, added noise increases from left to right. Only the leftmost boxplot in the Well Calibrated group is perfectly calibrated, and ← indicates calibration increases as noise decreases. Horizontal panels indicate which adjustment (if any) was applied to create the forecaster type. As sample size increases with vertical panels, P(Mc|y) decreases for all forecasters except Well Calibrated with little to no added noise.

We expect P(Mc|y) to be high for well calibrated forecasters and low for poorly calibrated forecasters. As sample size increase, P(Mc|y) decreases for all settings except the Well Calibrated forecasters with little to no added noise. This indicates our Bayesian approach performs sensibly in that the ability to correctly detect miscalibration increases with sample size. Additionally, as more noise is added to the Well Calibrated forecaster, their probability of calibration decreases as expected. Notice that under low sample sizes, Hedgers with large added noise appear well calibrated. This is not surprising as we have already established that hedging predictions with poor classification accuracy is a favorable strategy to achieve calibration. Also notice that under miscalibration misspecification (i.e., we assume miscalibration follows LLO when it actually follows Prelec), we see similar, if not improved, detection of miscalibration.

All simulated prediction sets were MLE recalibrated, and 95%, 90%, and 80% boldness-calibrated. Of the 17,500 total prediction sets, 95%, 90%, and 80% boldness-recalibration was successful in 99.4%, 99.2%, and 98.7% of cases, respectively. By “successful”, we mean that these sets were maximally emboldened while calibration of t = 0.95, 0.9, or 0.8 was maintained. In most of the small percentage of cases where boldness-recalibration was not successful, the underlying optimization was unable to converge to parameters that achieved the desired level of calibration. In the other few cases, adding random noise to the probabilities caused perfect separation of events and non-events. Under MLE recalibration, these predictions are all moved to either 0 or 1, where no further emboldening is possible. The sets where boldness-recalibration was not achievable were removed from the results, as our focus is demonstrating the capabilities of boldness-recalibration.

summarizes the change in sb, BSR, BSC, and ECE moving from the MLE recalibrated set to the 95% B-R set under LLO miscalibration. These lineplots are different from the lineplots from the Hockey example in that the y-axis shows the value of the metric of interest. Sample size increase with vertical panel. Horizontal panels denote the forecaster type. It is important to note the y-axis is not fixed across panel rows. The first column of points in each panel represents the value of each metric for the MLE recalibration set. The second column of points represents the same metric for the 95% B-R set. A line is used to connect points that correspond to the same original set of predictions. The lines and points are colored based on the amount of added noise. We choose to only show the metric values under MLE recalibration and 95% boldness-recalibration for two reasons: (i) it is best practice to not operate under poor calibration, whether it be the original set or sets at low boldness-recalibration thresholds like 80% and (ii) we see little change in these metrics moving from the 95% to the 90% B-R set. Additionally, we found that the results for the Prelec function were nearly identical to those for LLO miscalibration, so we focus on LLO here. Results for all sets can be found in the online supplement.

Fig. 6 Lineplots summarizing on the y-axis the change in (a) boldness measured by sb, (b) Brier Score Resolution, (c) Brier Score Calibration, and (d) Expected Calibration Error for 100 MC runs on LLO-miscalibrated simulated forecasters. Sample size increase with vertical panel. Horizontal panels denote the forecaster type. The first column of points in each panel represents the value of each metric for the MLE recalibration set. The second column of points represents the same metric for the 95% B-R set. A line is used to connect points that correspond to the same original set of predictions. The lines and points are colored based on the amount of added noise. Note that y-axis is not fixed across panel rows, the points/lines are plotted in a randomized order, and one set is removed from each panel due to random perfect separation issues as described in the text.

Fig. 6 Lineplots summarizing on the y-axis the change in (a) boldness measured by sb, (b) Brier Score Resolution, (c) Brier Score Calibration, and (d) Expected Calibration Error for 100 MC runs on LLO-miscalibrated simulated forecasters. Sample size increase with vertical panel. Horizontal panels denote the forecaster type. The first column of points in each panel represents the value of each metric for the MLE recalibration set. The second column of points represents the same metric for the 95% B-R set. A line is used to connect points that correspond to the same original set of predictions. The lines and points are colored based on the amount of added noise. Note that y-axis is not fixed across panel rows, the points/lines are plotted in a randomized order, and one set is removed from each panel due to random perfect separation issues as described in the text.

Notice in that when moving from maximal calibration to 95% calibration, all sets are emboldened to some degree. However, as added noise increases, boldness decreases. This is desirable as less noisy predictions should be more bold than more noisy predictions. The distinction in sb between levels of added noise becomes more clear with higher sample sizes. While the increase in sb from the MLE recalibrated set and 95% B-R set diminishes with sample size, we expect that the degree to which emboldening is appropriate also diminishes. Where data is abundant and proved to be extremely reliable, there may be less need for emboldening, as the predictions are already useful for decision making.

As for the Brier Score Resolution, notice in that there is little change between MLE recalibration and 95% B-R under small sample sizes. In just a few cases under n = 30, we see that BSR decreases after emboldening. There is virtually no change in BSR under large sample sizes. As expected, with more added noise, BSR is typically lower.

Brier Score Calibration is much less consistent across MC runs than the other metrics. Notice in that there is little to no distinction in BSC between levels of added noise. Additionally, at small samples sizes BSC sometimes improves but other times worsens. Under large sample sizes, BSC always worsens as we would expected given that we are sacrificing calibration in favor of boldness. Similarly, in we see that there is little distinction in expected calibration error between levels of added noise. However, we see more consistency in terms of the degree of increase in ECE across MC runs. Under small samples sizes, we see little to no change in ECE. As sample size increases, we see larger increases in ECE moving from MLE recalibration to 95% B-R.

5 Conclusion

This article develops boldness-recalibration methodology surrounding the fundamental tension between calibration and boldness of predicted probabilities, subject to classification accuracy. While some methods consider concepts similar to boldness, such as resolution, relative to calibration, none provide direct control of the tradeoff between the two. Our proposed Bayesian calibration assessment and boldness-recalibration approaches address this gap. This article is for those who would consider making a small sacrifice in posterior probability of calibration to gain boldness so as to study riskier predictions for decision making.

The backbone of these approaches is the interpretable (in a Bayesian sense) posterior model probability, P(Mc|y), which serves as a measure of calibration and is interpreted as the probability a set of predictions is calibrated, given the data observed. We define boldness as the spread (i.e., standard deviation) in predictions. In boldness-recalibration, the user pre-specifies a tolerable risk of miscalibration (e.g., P(Mc|y) = 0.95) and subject to this constraint, our method maximizes spread in predictions and thus, boldness. The difference in the posterior model probabilities for the original and boldness-recalibrated sets concisely quantifies the calibration-boldness tradeoff. By pre-specifying calibration via P(Mc|y), the user is given direct control of the boldness calibration tradeoff.

Boldness-recalibration provides a means of appropriately emboldening probability predictions. The Hockey case study shows that “appropriate” may have different meaning depending on the quality of the data. The predictions from FiveThirtyEight were substantially emboldened while maintaining reasonable calibration. This indicates their predictions are reliable but overly cautious. However, boldness-recalibration showed that the random noise forecaster should bring their predictions in toward the base rate rather than embolden. In this case, it is more appropriate to un-embolden as their predictions were not reliable or useful for decision making. “Appropriate” may also have different meaning depending on the context of the data. While we commonly used 95% as a tolerable level of miscalibration to enable emboldening, we cannot recommend this as a general rule across applications. It is up to the discretion of the user to carefully considered what level of miscalibration is tolerable in their application area.

We demonstrate via simulation study that P(Mc|y) correctly identifies miscalibration and appropriately emboldens in nearly all forecaster types at a reasonable sample size, even under miscalibration mispecification. After correcting miscalibration in the simulated predictions, we see an increase in boldness across all sets. In cases where the original predictions are noisy, spread is lower in the boldness-recalibrated set than in the original and higher in for those that are accurate.

While we leverage spread to measure boldness and the LLO function to recalibrate, one could consider alternatives to these choices. The core idea of selecting an emboldening plan that satisfies a required probability of calibration still holds. Another potential future research goal of interest is that of subjective elicitation of prior probabilities of calibration. While we use P(Mc) = P(Mu)=12, this may not be ideal in situations where prior information can be obtained. Additionally, another goal is to investigate potential dependence structure among forecasts, as this methodology does not currently enable analysis of dependence between forecasts in a set. While we provide one example of a use-case in this article, we propose these methods are useful in many situations where there are predicted probabilities of binary event. These methods allow decision makers to rely on these predictions for make informed decisions. Appropriately emboldened predictions, as produced by Boldness-recalibration in an interpretable manner, enable better decision making in these critical situations.

Supplementary Materials

The supplemental material available online for this article consists of (1) a supplemental document describing additional methods and results, (2) an R script fully reproducing the hockey home team win predictions case study in Section 3 using the BRcal R package, and (3) a bundled version of the BRcal R package at the time of article acceptance. For the current version of the BRcal package, see https://github.com/apguthrie/BRcal.

Supplemental material

TAS_Reproduce_HockeyEx.R

Download R Objects File (2.6 KB)

Online_Supplement.pdf

Download PDF (1.9 MB)

BRcal_0.0.0.9000.tar.gz

Download (330.3 KB)

Acknowledgments

The authors are thankful to Leidos for providing the funding for this work. The authors would like to thank Chris Wilson, Damon Kuehl, Matthew Keefe, Andrew McCoy, Tyler Cody, Xin Xing, and Bill Woodall for their insights and roles obtaining case study data for this line of research. The authors are grateful for the helpful comments from the associate editor and two referees.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Data Availability Statement

The data that support the findings of this study are openly available as supplemental materials accompanying this manuscript and included with the BRcal package in R.

References

  • Atanasov, P., Rescober, P., Stone, E., Swift, S. A., Servan-Schreiber, E., Tetlock, P., Ungar, L., and Mellers, B. (2017), “Distilling the Wisdom of Crowds: Prediction Markets vs. Prediction Polls,” Management Science, 63, 691–706. DOI: 10.1287/mnsc.2015.2374.
  • Baron, J., Mellers, B. A., Tetlock, P. E., Stone, E., and Ungar, L. H. (2014), “Two Reasons to Make Aggregated Probability Forecasts More Extreme,” Decision Analysis, 11, 133–145. DOI: 10.1287/deca.2014.0293.
  • Brier, G. W. (1950), “Verification of Forecasts Expressed in Terms of Probability,” Monthly Weather Review, 78, 1–3. DOI: 10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
  • Bross, I. D. J. (1953), Design for Decision, New York: Macmillan.
  • Dalton, J. E. (2013), “Flexible Recalibration of Binary Clinical Prediction Models,” Statistics in Medicine, 32, 282–289. DOI: 10.1002/sim.5544.
  • Dawid, A. P. (1982), “The Well-Calibrated Bayesian,” Journal of the American Statistical Association, 77, 605–610. DOI: 10.1080/01621459.1982.10477856.
  • DeGroot, M. H., and Fienberg, S. E. (1983), “The Comparison and Evaluation of Forecasters,” Journal of the Royal Statistical Society, Series D, 32, 12–22.
  • Dimitriadis, T., Gneiting, T., and Jordan, A. I. (2021), “Stable Reliability Diagrams for Probabilistic Classifiers,” Proceedings of the National Academy of Sciences, 118, e2016191118. https://www.pnas.org/doi/abs/10.1073/pnas.2016191118 DOI: 10.1073/pnas.2016191118.
  • Gonzalez, R., and Wu, G. (1999), “On the Shape of Probability Weighting Function,” Cognitive Psychology, 38, 129–66. DOI: 10.1006/cogp.1998.0710.
  • Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017), “On Calibration of Modern Neural Networks,” CoRR abs/1706.04599. http://arxiv.org/abs/1706.04599
  • Han, Y., and Budescu, D. V. (2022), “Recalibrating Probabilistic Forecasts to Improve their Accuracy,” Judgement and Decision Making, 17, 91–123.
  • Kass, R. E., and Raftery, A. E. (1995), “Bayes Factors,” Journal of the American Statistical Association, 90, 773–795. DOI: 10.1080/01621459.1995.10476572.
  • Kass, R. E., and Wasserman, L. (1995), “A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion,” Journal of the American Statistical Association, 90, 928–934. https://www.tandfonline.com/doi/abs/10.1080/01621459.1995.10476592 DOI: 10.1080/01621459.1995.10476592.
  • Lichtendahl, K. C., Grushka-Cockayne, Y., Jose, V. R. R., and Winkler, R. L. (2022), “Extremizing and Anti-Extremizing in Bayesian Ensembles of Binary-Event Forecasts,” Operations Research, 70, 2998–3014. DOI: 10.1287/opre.2021.2176.
  • Murphy, A. H. (1973), “A New Vector Partition of the Probability Score,” Journal of Applied Meteorology, 12, 595–600. DOI: 10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.
  • Murphy, A. H., and Winkler, R. L. (1977), “Reliability of Subjective Probability Forecasts of Precipitation and Temperature,” Journal of the Royal Statistical Society, Series C, 26, 41–47.
  • Naeini, M. P., Cooper, G., and Hauskrecht, M. (2015), “Obtaining Well Calibrated Probabilities Using Bayesian Binning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 29. DOI: 10.1609/aaai.v29i1.9602.
  • Platt, J. (2000), “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods,” Advances in Large-Margin Classifiers, 10, 61–74.
  • Prelec, D. (1998), “The Probability Weighting Function,” Econometrica, 66, 497–528. DOI: 10.2307/2998573.
  • Ranjan, R., and Gneiting, T. (2010), “Combining Probability Forecasts,” Journal of the Royal Statistical Society, Series B, 72, 71–91. DOI: 10.1111/j.1467-9868.2009.00726.x.
  • Roitberg, A., Peng, K., Schneider, D., Yang, K., Koulakis, M., Martínez, M., and Stiefelhagen, R. (2022), “Is My Driver Observation Model Overconfident? Input-Guided Calibration Networks for Reliable and Interpretable Confidence Estimates,” ArXiv abs/2204.04674.
  • Satopää, V. A. (2022), “Regularized Aggregation of One-Off Probability Predictions,” Operational Research, 70, 3558–3580. DOI: 10.1287/opre.2021.2224.
  • Turner, B., Steyvers, M., Merkle, E., Budescu, D., and Wallsten, T. (2014), “Forecast Aggregation via Recalibration,” Machine Learning, 95, 261–289. DOI: 10.1007/s10994-013-5401-4.
  • Zadrozny, B., and Elkan, C. P. (2001), “Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers,” in ICML.
  • ———(2002), “Transforming Classifier Scores into Accurate Multiclass Probability Estimates,” in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.