Comparing Predictive Model Performance with Confidence Curves (2022-US-45MP-1141)

1 Kudo

Bryan Fricke, JMP Principal Software Developer, SAS
Russ Wolfinger, Director, Research and Development, JMP

Repeated k-fold cross-validation is commonly used to evaluate the performance of predictive models. The problem is, how do you know when a difference in performance is sufficiently large to declare one model better than another? Typically, null hypothesis significance testing (NHST) is used to determine if the differences between predictive models are “significant”, although the usefulness of NHST has been debated extensively in the statistics literature in recent years. In this paper, we discuss problems associated with NHST and present an alternative known as confidence curves, which has been developed as a new JMP Add-In that operates directly on the results generated from JMP Pro's Model Screening platform.

Hello. My name is Bryan Fricke.

I'm a product manager at JMP focused on the JMP user experience.

Previously, I was a software developer working on exporting reports

to standalone HTML fire files.

JMP Live and JMP Public.

In this presentation,

I'm going to talk about using Confidence Curves

as an alternative to null hypothesis significance testing

in the context of predictive model screening.

Additional material on this subject can be found on the JMP Community website

in the paper associated with this presentation.

Dr. Russ Wolfinger is a Distinguished Research Fellow at JMP

and a co- author, and I would like to thank him for his contributions.

The Model Screening Platform, introduced in JMP Pro 16

allows you to evaluate the performance

of multiple predictive models using cross validation.

To show you how the Model Screening platform works,

I'm going to use the Diabetes Data table,

which is available in the JMP sample data library.

I'll choose model screening from the analyzed predictive modeling menu.

JMP responds by displaying the Model Screening dialogue.

The first three columns in the data table

represent disease progression in continuous binary and ordinal forms.

I'll use the continuous column named Y as the response variable.

I'll use the columns from age to glucose in the X factor role.

I'll type 1234 in the set random seed input box for reproducibility,

I'll select the check box next to K-Fold cross validation and leave K set to five.

I'll type 3 into the input box next to repeated K-F old.

In the method list, I'll unselect neural.

Now I'll click Okay.

JMP responds by training and validating models for each of the selected methods

using their default parameter settings and cross validation.

After completing the training and validating process,

JMP displays the results in a new window.

For each modeling method.

The Model Screening platform provides performance measures

in the form of point estimates for the coefficient of determination,

also known as R squared, the root average squared error,

and the standard deviation for the root average squared error.

Now I'll click select dominant.

JMP responds by highlighting the method

that performs best across the performance measures.

What's missing here is a graphic to show the size of the differences

between the dominant method and the other methods,

along with the visualization of the uncertainty

associated with the differences.

But why not just show P- values indicating whether the differences are significant?

Shouldn't a decision about whether one model is superior to another

be based on significance?

First, since the P- value provides a probability

based on a standardized difference,

a P-value by itself loses information about the raw difference.

A a significant difference doesn't imply a meaningful difference.

Is that really a problem?

I mean, isn't it pointless

to be concerned with the size of the difference between two models

before using significance testing

to determine whether the difference is real?

The problem with that line of thinking is that it's power or one minus beta

that determines our ability to correctly reject a null hypothesis.

Authors such as Jacob Cohen and Frank Smith

have suggested that typical studies have the power to detect differences

in the range of .4 to .6 .

So let's suppose we have a difference.

Where the power to detect a true difference?

Zero five at an alpha level zero five.

That suggests we would detect the true difference, on average 50% of the time.

So in that case, significance testing would identify real differences

no better than flipping an unbiased coin.

If all other things are equal, type 1 and type 2 errors are equivalent.

But significance tests that use an alpha value of .05.

Often implicitly assume type 2 errors are preferable to type 1 errors,

particularly if the power is as low as .5 .

A common suggestion to address these and other issues with significance testing

is to show the point estimate along with the confidence intervals.

One objection to doing so is that

a point estimate along with a 95% confidence interval

is effectively the same thing as significance testing.

Even if we assume that is true, a point estimate and confidence interval

still puts the magnitude of the difference

and the range of the uncertainty front and center,

whereas a loan P-value conceals them both.

So various authors, including Cohen and Smith,

have recommended replacing significance testing

with point estimates and confidence intervals.

Even so, the recommendation to use confidence intervals begs the question,

which ones do we show?

Showing only the 95% confidence interval

would likely encourage you to interpret it as another form of significance testing.

The solution provided by Confidence Curves

is to literally show all confidence intervals

up to an arbitrarily high confidence level.

How do I show Confidence Curves and JMP?

To conveniently create Confidence Curves and JMP,

install the Confidence Curves add-in by visiting the JMP Community Homepage.

Type Confidence Curves into the search input field.

Click the Confidence Curves result.

Now click the download icon next to confidencecurves. JMPa dd- in.

Now click the downloaded file.

JMP responds by asking if I want to install the add- in.

You would click Install.

However, I'll click cancel as I've already installed the add-in.

So how do you use the add-in?

First.

To generate Confidence Curves for this report, select Save Results table

from the top red triangle menu located on the Model Screening Report window.

JMP responds by creating a new table containing among others.

The following columns trial.

Which contains the identifiers for the three sets

of cross validation results.

Fold.

Which contains the identifiers

for the five distinct sets of subsamples used for validation in each trial.

Method.

Which contains the methods used to create models from the test data.

And in.

Which contains the number of data points used in the validation folds.

Note that the trial column will be missing if the number of repeats is exactly one,

in which case the trial column is neither created nor needed.

Save for that exception, these columns are essential

for the Confidence Curves add-in to function properly.

In addition to these columns, you need one column

that provides the metric to compare between methods.

I'll be using R squared as the metric of interest in this presentation.

Once you have the model screening results table,

click add-ins from JMPs main menu bar and then select Confidence Curves.

The logic that follows would be better placed in a wizard,

and I hope to add that functionality in a future release of this add-in.

As it is, the first dialog that appears

requests you to select the name of the table

that was generated when you chose 'save results table',

from the Model Screen Reports Red Triangle menu.

The name of the table in this case is Model Screen Statistics Validation Set.

Next, a dialogue is displayed

that requests the name of the method that will serve as the baseline

from which all the other performance metrics are measured.

I suggest starting with the method that was selected when you click

the Select Dominant option

in the Model Screening Report window, which in this case is Fit Stepwise.

Finally, a dialog is displayed

that requests you to select the metric to be compared between the various methods.

As mentioned earlier, I'll use R squared as the metric for comparison.

JMP responds by creating a Confidence Curve table that contains P-values

and corresponding confidence levels for the mean difference

between the chosen baseline method and each of the other methods.

More specifically, the generated table has columns for the following.

Model.

In which each row contains the name of the modeling method

whose performance is evaluated relative to the baseline method.

P- value in which each row contains the probability

associated with the performance difference

at least as extreme as the value shown in the difference in our square column.

Confidence interval in which each row contains the confidence level.

We have that the true mean is contained in the associated interval

and finally, difference in our square,

in which each row is the maximum or minimum

of the expected difference in R squared

associated with the confidence level shown in the confidence interval column.

From this table, Confidence Curves are created

and shown in the Graph Builder graph.

So what are Confidence Curves?

To clarify the key attributes of a Confidence Curve,

I'll hide all but the Support Vector machine's Confidence Curve

using the local data filter by clicking on Support Vector Machines.

By default, a Confidence Curve only shows

the lines that connect the extremes of each confidence interval.

To see the points, select Show Control Panel from the red triangle menu

located next to the text that reads Graph Builder in the Title bar.

Now I'll shift click the points icon.

JMP responds by displaying the endpoints of the confidence intervals

that make up the Confidence Curve.

Now I will zoom in and examine a point.

If you hover the mouse pointer over any of these points.

A hover label shows the P - value confidence interval,

difference in the size of the metric

and the method used to generate the model

being compared to the reference model.

Now we'll turn off the points by shift clicking the points icon

and clicking the Done button.

Even though the individual points are no longer shown,

you can still view the associated hover label

by placing the mouse pointer over the Confidence Curve.

Point estimate for the main difference

in performance between the sport vector machines and Fit Step Wise models

is shown at the 0% confidence level,

which is the mean value of the differences computed using cross validation.

A Confidence Curve plots the extent

of each confidence interval from the generated table between zero

and the 99.99% confidence level along with the left Y axis.

P values associated with the confidence intervals are shown

along the right y axis.

The confidence level associated with each confidence interval shown.

The Y axis uses a log scale

so that more resolution is shown at higher confidence levels.

By default, two reference lines are plotted alongside a Confidence Curve.

The vertical line represents

the traditional null hypothesis of no difference in effect.

Note you can change the vertical line position

and thereby the implicit null hypothesis.

In the X axis settings.

The horizontal line passes through the conventional 95% confidence interval.

As with the vertical reference line,

you can change the horizontal line position

and thereby the implicit level of significance

by changing the Y axis settings.

If a Confidence Curve crosses the vertical line above the horizontal line,

you cannot reject the null hypothesis using significance testing.

For example, we cannot reject the null hypothesis for support vector machines.

On the other hand, if a Confidence Curve

crosses the vertical line below the horizontal line,

you can reject the null hypothesis using significance testing.

For example, we can reject the null hypothesis for boosted tree.

How are Confidence Curves computed?

The current implementation of confidence curves assumes

the differences are computed

using R times repeated K-fold cross validation.

The extent of each confidence interval is computed

using what is known as a variance corrected resampled T-test.

Note that authors Claude Nadeau and Yoshua Bengio,

note that a corrected resampled T-test

is typically used in cases where training sets

are five or ten times larger than validation sets.

For more details, please see the paper associated with this presentation.

So how are Confidence Curves interpreted?

First, the Confidence Curve graphically depicts

the main difference in the metric of interest between a given method

and a reference method at the 0% confidence level.

So we can evaluate whether the mean difference between the methods

is meaningful.

If the main difference isn't meaningful, there's little point in further analysis

of a given method versus the reference method with respect to the chosen metric.

What constitutes a meaningful difference depends on the metric of interest

as well as the intended scientific or engineering application.

For example,

you can see the model developed with a decision tree method

is on average about 14% worse than Fit Step Wise,

which arguably is a meaningful difference.

If the difference is meaningful,

we can evaluate how precisely the difference has been measured

by evaluating how much the Confidence C urve width changes

across the confidence levels.

For any confidence interval not crossing the default vertical axis,

we have at least that level of confidence that the main difference is nonzero.

For example,

the decision tree confidence curve doesn't cross the Y axis

until about the 99.98% confidence level.

We are nearly 99.98% confident the mean difference isn't equal to zero.

In fact, with this data,

it turns out that we can be about 81% confident

that Fit Step Wise is at least as good,

if not better, than every method other than generalized regression Lasso.

Now let's consider the relationship between Confidence Curves.

If two or more Confidence Curves significantly overlap

and the mean difference of each is not meaningfully different from the other.

The data suggest each method performs about the same as the other

with respect to the reference model.

So for example,

we see that on average, the Support vector Machines model

performs less than . 5% than Bootstrap Forest,

which is arguably not a meaningful difference.

The confidence intervals do not overlap until about the 4% confidence level,

which suggests these values would be expected

if both methods really do have about the same difference in performance

with respect to the reference.

If the average difference in performance

is about the same for two confidence curves,

but the confidence intervals don't overlap too much,

the data suggests

the models perform about the same as each other with respect to the reference model.

However, we are confident of a non- meaningful difference.

This particular case is rare than the others

and I don't have an example to show with this data set.

On the other hand, if the average difference in performance

between a pair of Confidence Curves is meaningfully different

and the confidence curves have little overlap,

the data suggests the models perform

differently from one another with respect to the reference.

For example, the generalized regression Lasso model predicts about 13.8%

more of the variation in the response than does the decision tree model.

Moreover, the Confidence Curves don't overlap

until about the 99.9% confidence level,

which suggests these results are quite unusual

if the methods actually perform about the same with respect to the reference.

Finally, if the average difference in performance between a pair

of Confidence Curves is meaningfully different from one another

and the curves have considerable overlap,

the data suggests that

while the methods perform differently from one another with respect to the reference,

it wouldn't be surprising if the difference is furious.

For example, we can see that on average

support vector machines predicted about 1.4% more of the variance in the response

than did [inaudible 00:19:18] nearest neighbors.

However, the confidence intervals begin to overlap at about the 17% confidence level,

which suggests it wouldn't be surprising

if the difference in performance between each method

and the reference is actually smaller than suggested by the point estimates.

Simultaneously,

it wouldn't be surprising if the actual difference is larger than measured,

or if the direction of the difference is actually reversed.

In other words, the difference in performance is uncertain.

Note that it isn't possible to assess the variability and performance

between two models relative to one another

when the differences are relative to a third model.

To compare the variability and performance

between two methods relative to one another,

one of the two methods must be the reference method

from which the differences are measured.

But what about multiple comparisons?

Don't we need to adjust the P-values

to control the family wise type 1 error rate?

In this paper about Confidence Curves,

Daniel Burrough suggests that adjustments are needed in confirmatory studies

where a goal is prespecified, but not in exploratory studies.

This idea suggests using unadjusted P-values for multiple Confidence Curves

in an exploratory fashion

and only a single Confidence Curve generated from different data

to confirm your findings of a significant difference

between two methods when using significance testing.

That said, please keep in mind the dangers of cherry picking, p-hacking

when conducting exploratory studies.

In summary, the model screening platform introduced in JMP Pro 16

provides a means to simultaneously compare the performance

of predictive models created using different methodologies.

JMP has a long standing goal to provide a graph with every statistic,

and Confidence Curves help to fill that gap.

For the model screening platform.

You might naturally expect to use significance testing to differentiate

between the performance of various methods being compared.

However, P-values have come under increased scrutiny in recent years

for obscuring the size of performance differences.

In addition,

P-values are often misinterpreted as the probability the null hypothesis is true.

Instead of P-value is the probability of observing a difference

as or more extreme, assuming the null hypothesis is true.

The probability of correctly rejecting the null hypothesis when it is false

is determined by a power or one minus beta.

I have argued that it is not uncommon to only have a 50% chance

of correctly rejecting the null hypothesis with an alpha value of .05 .

As an alternative, a confidence interval could be shown instead of a loan P- value.

However, the question would be left open as to which confidence level to show.

Confidence Curves address these concerns by showing all confidence intervals

up to an arbitrarily high- level of confidence.

The mean difference in the performance

is clearly visible at the 0% confidence level, and that acts as a point estimate.

All of the things being equal, type 1 and type 2 errors are equivalent.

Confidence Curves don't embed a bias towards trading type 1 errors for type 2.

Even so, by default,

a vertical line is shown in the confidence curve graph

for the standard null hypothesis of no difference.

In addition,

a horizontal line is shown that delineates the 95% confidence interval,

which readily affords a typical significance testing analysis if desired.

The defaults for these lines are easily modified,

but different null hypothesis and confidence levels desired.

Even so, given the rather broad and sometimes emphatic suggestion

to replace significance testing with point estimates and confidence intervals,

it may be best to view a Confidence Curve as a point estimate

along with a nearly comprehensive view of its associated uncertainty.

If you have feedback about the confidence curve's add-in,

please leave a comment on the JMP community site.

And don't forget to vote for this presentation

if you found it interesting or useful.

Thank you for watching this presentation and I hope you have a great day.