cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Comparing Predictive Model Performance with Confidence Curves (2022-US-45MP-1141)

Bryan Fricke, JMP Principal Software Developer, SAS
Russ Wolfinger, Director, Research and Development, JMP

 

Repeated k-fold cross-validation is commonly used to evaluate the performance of predictive models. The problem is, how do you know when a difference in performance is sufficiently large to declare one model better than another? Typically, null hypothesis significance testing (NHST) is used to determine if the differences between predictive models are “significant”, although the usefulness of NHST has been debated extensively in the statistics literature in recent years. In this paper, we discuss problems associated with NHST and present an alternative known as confidence curves, which has been developed as a new JMP Add-In that operates directly on the results generated from JMP Pro's Model Screening platform.

 

 

Hello. My  name  is  Bryan  Fricke.

I'm  a  product  manager  at  JMP  focused  on  the  JMP  user  experience.

Previously,  I  was  a  software  developer  working  on  exporting  reports

to  standalone  HTML  fire  files.

JMP  Live  and  JMP  Public.

In  this  presentation,

I'm  going  to  talk  about  using   Confidence Curves

as  an  alternative  to  null  hypothesis  significance  testing

in  the  context  of  predictive  model  screening.

Additional  material  on  this  subject  can  be  found  on  the  JMP  Community  website

in  the  paper  associated  with  this  presentation.

Dr.  Russ  Wolfinger  is a  Distinguished  Research  Fellow  at  JMP

and  a  co- author,  and  I  would  like  to  thank  him  for  his  contributions.

The  Model  Screening  Platform, introduced  in  JMP  Pro  16

allows  you  to  evaluate  the  performance

of  multiple  predictive  models using  cross  validation.

To  show  you  how the  Model  Screening  platform  works,

I'm  going  to  use  the  Diabetes  Data  table,

which  is  available  in  the  JMP sample  data  library.

I'll  choose  model  screening  from  the analyzed  predictive  modeling  menu.

JMP  responds  by  displaying the  Model  Screening  dialogue.

The  first  three  columns  in  the  data  table

represent  disease  progression in  continuous  binary  and  ordinal  forms.

I'll  use  the  continuous  column named  Y  as  the  response  variable.

I'll  use  the  columns  from  age  to  glucose in  the  X  factor  role.

I'll  type  1234  in  the  set  random  seed input  box  for  reproducibility,

I'll  select  the  check  box  next  to  K-Fold cross  validation  and  leave  K  set  to  five.

I'll   type 3  into  the  input box  next  to  repeated  K-F old.

In  the  method  list,  I'll  unselect  neural.

Now  I'll  click  Okay.

JMP  responds  by  training  and  validating models  for  each  of  the  selected  methods

using  their  default  parameter settings  and  cross  validation.

After  completing  the  training and  validating  process,

JMP  displays  the  results  in  a  new  window.

For  each  modeling  method.

The  Model  Screening  platform provides  performance  measures

in  the  form  of  point  estimates for  the  coefficient  of  determination,

also  known  as  R  squared, the  root  average  squared  error,

and  the  standard  deviation for  the  root  average  squared  error.

Now  I'll  click  select  dominant.

JMP  responds  by  highlighting  the  method

that  performs  best  across the  performance  measures.

What's  missing  here  is  a  graphic  to  show the  size  of  the  differences

between  the  dominant  method and  the  other  methods,

along  with  the  visualization of  the  uncertainty

associated  with  the  differences.

But  why  not  just  show  P- values  indicating whether  the  differences  are  significant?

Shouldn't  a  decision  about  whether one  model  is  superior  to  another

be  based  on  significance?

First,  since  the  P- value provides  a  probability

based  on  a  standardized  difference,

a  P-value  by  itself  loses information  about  the  raw  difference.

A a  significant  difference  doesn't imply  a  meaningful  difference.

Is  that  really  a  problem?

I  mean,  isn't  it  pointless

to  be  concerned  with  the  size  of the  difference  between  two  models

before  using  significance  testing

to  determine  whether the  difference  is  real?

The  problem  with  that  line  of  thinking  is that  it's  power  or  one  minus  beta

that  determines  our  ability to  correctly  reject  a  null  hypothesis.

Authors  such  as  Jacob  Cohen and  Frank  Smith

have  suggested  that  typical  studies have  the  power  to  detect  differences

in  the  range  of  .4 to .6 .

So  let's  suppose  we  have  a  difference.

Where  the  power  to  detect a  true  difference?

Zero  five  at  an  alpha  level  zero  five.

That  suggests  we  would  detect  the  true difference,  on  average  50%  of  the  time.

So  in  that  case,  significance testing  would  identify  real  differences

no  better  than flipping  an  unbiased  coin.

If  all  other  things  are  equal,   type 1 and   type 2  errors  are  equivalent.

But  significance  tests that  use  an  alpha  value  of  .05.

Often  implicitly  assume   type 2  errors are  preferable  to   type 1  errors,

particularly  if the  power  is  as  low  as  .5 .

A  common  suggestion  to  address  these and  other  issues  with  significance  testing

is  to  show  the  point  estimate  along with  the  confidence  intervals.

One  objection  to  doing  so  is  that

a  point  estimate along  with  a  95%  confidence  interval

is  effectively  the  same thing  as  significance  testing.

Even  if  we  assume  that  is  true, a  point  estimate  and  confidence  interval

still  puts  the  magnitude  of  the  difference

and  the  range  of  the  uncertainty front  and  center,

whereas  a  loan P-value  conceals  them  both.

So  various  authors, including  Cohen  and  Smith,

have  recommended  replacing significance  testing

with  point  estimates and  confidence  intervals.

Even  so,  the  recommendation  to  use confidence  intervals  begs  the  question,

which  ones  do  we  show?

Showing  only  the  95%  confidence  interval

would  likely  encourage  you  to  interpret  it as  another  form  of  significance  testing.

The  solution  provided  by   Confidence Curves

is  to  literally  show all  confidence  intervals

up  to  an  arbitrarily high  confidence  level.

How  do  I  show   Confidence Curves  and  JMP?

To  conveniently  create Confidence Curves  and  JMP,

install  the   Confidence Curves   add-in  by visiting  the  JMP  Community  Homepage.

Type   Confidence Curves into  the  search  input  field.

Click  the   Confidence Curves  result.

Now  click  the  download  icon  next to  confidencecurves. JMPa dd- in.

Now  click  the  downloaded  file.

JMP  responds  by  asking  if  I want  to  install  the  add- in.

You  would  click  Install.

However,  I'll  click  cancel  as I've  already  installed  the  add-in.

So  how  do  you  use  the  add-in?

First.

To  generate   Confidence Curves  for  this report,  select  Save  Results  table

from  the  top  red  triangle  menu  located on  the  Model  Screening  Report  window.

JMP  responds  by  creating a  new  table  containing  among  others.

The  following  columns  trial.

Which  contains  the  identifiers for  the  three  sets

of  cross validation  results.

Fold.

Which  contains  the  identifiers

for  the  five  distinct  sets  of  subsamples used  for  validation  in  each  trial.

Method.

Which  contains  the  methods  used  to  create models  from  the  test  data.

And  in.

Which  contains  the  number  of  data points  used  in  the  validation  folds.

Note  that  the  trial  column  will  be  missing if  the  number  of  repeats  is  exactly  one,

in  which  case  the  trial  column is  neither  created  nor  needed.

Save  for  that  exception, these  columns  are  essential

for  the   Confidence Curves add-in  to  function  properly.

In  addition  to  these  columns, you  need  one  column

that  provides the  metric  to  compare  between  methods.

I'll  be  using  R  squared  as  the  metric of  interest  in  this  presentation.

Once  you  have the  model  screening  results  table,

click   add-ins  from  JMPs  main  menu bar  and  then  select   Confidence Curves.

The  logic  that  follows  would  be  better placed  in  a  wizard,

and  I  hope  to  add  that  functionality in  a  future  release  of  this   add-in.

As  it  is,  the  first  dialog  that  appears

requests  you  to  select the  name  of  the  table

that  was  generated  when  you  chose 'save  results  table',

from  the Model  Screen  Reports  Red  Triangle  menu.

The  name  of  the  table  in  this  case  is Model  Screen  Statistics  Validation  Set.

Next,  a  dialogue  is  displayed

that  requests  the  name  of  the  method that  will  serve  as  the  baseline

from  which  all  the  other performance  metrics  are  measured.

I  suggest  starting  with  the  method that  was  selected  when  you  click

the  Select  Dominant  option

in  the  Model  Screening  Report  window, which  in  this  case  is  Fit  Stepwise.

Finally,  a  dialog  is  displayed

that  requests  you  to  select  the  metric  to be  compared  between  the  various  methods.

As  mentioned  earlier,  I'll  use R  squared  as  the  metric  for  comparison.

JMP  responds  by  creating  a  Confidence Curve  table  that  contains  P-values

and  corresponding  confidence  levels for  the  mean  difference

between  the  chosen  baseline  method and  each  of  the  other  methods.

More  specifically,  the  generated  table  has columns  for  the  following.

Model.

In  which  each  row  contains  the  name of  the  modeling  method

whose  performance  is  evaluated relative  to  the  baseline  method.

P- value  in  which  each  row contains  the  probability

associated with  the  performance  difference

at  least  as  extreme  as  the  value  shown in  the  difference  in  our  square  column.

Confidence  interval  in  which  each row  contains  the  confidence  level.

We  have  that  the  true  mean  is  contained in  the  associated  interval

and  finally,  difference  in  our  square,

in  which  each  row is  the  maximum  or  minimum

of  the  expected  difference  in  R  squared

associated  with  the  confidence  level shown  in  the  confidence  interval  column.

From  this  table, Confidence Curves  are  created

and  shown in  the  Graph  Builder  graph.

So  what  are   Confidence Curves?

To  clarify  the  key  attributes of  a   Confidence Curve,

I'll  hide  all  but  the  Support  Vector machine's   Confidence Curve

using  the  local  data  filter  by  clicking on  Support  Vector  Machines.

By  default,  a   Confidence Curve  only  shows

the  lines  that  connect  the  extremes of  each  confidence  interval.

To  see  the  points,  select  Show Control  Panel  from  the  red  triangle  menu

located  next  to  the  text that  reads  Graph  Builder  in  the  Title  bar.

Now  I'll  shift  click  the  points  icon.

JMP  responds  by  displaying  the  endpoints of  the  confidence  intervals

that  make  up  the   Confidence Curve.

Now  I  will  zoom  in  and  examine  a  point.

If  you  hover  the  mouse  pointer  over any  of  these  points.

A  hover  label  shows  the P - value confidence  interval,

difference  in  the  size of  the  metric

and  the  method  used to  generate  the  model

being  compared  to  the  reference  model.

Now  we'll  turn  off  the  points by  shift  clicking  the  points  icon

and  clicking  the  Done  button.

Even  though  the  individual  points are  no  longer  shown,

you  can  still  view the  associated  hover  label

by  placing  the  mouse  pointer over  the   Confidence Curve.

Point  estimate  for  the  main  difference

in  performance  between  the  sport  vector machines  and   Fit Step Wise  models

is  shown  at  the  0%  confidence  level,

which  is  the  mean  value  of  the  differences computed  using  cross  validation.

A   Confidence Curve  plots  the  extent

of  each  confidence  interval from  the  generated  table  between  zero

and  the  99.99%  confidence  level along  with  the  left  Y  axis.

P  values  associated  with the  confidence  intervals  are  shown

along  the  right  y  axis.

The  confidence  level  associated with  each  confidence  interval  shown.

The  Y  axis  uses  a  log  scale

so  that  more  resolution  is  shown at  higher  confidence  levels.

By  default,  two  reference  lines  are plotted  alongside  a   Confidence Curve.

The  vertical  line  represents

the  traditional  null  hypothesis of  no  difference  in  effect.

Note  you  can  change the  vertical  line  position

and  thereby the  implicit  null  hypothesis.

In  the  X axis  settings.

The  horizontal  line  passes  through the  conventional  95%  confidence  interval.

As  with  the  vertical  reference  line,

you  can  change the  horizontal  line  position

and  thereby the  implicit  level  of  significance

by  changing the  Y  axis  settings.

If  a   Confidence Curve  crosses  the  vertical line  above  the  horizontal  line,

you  cannot  reject  the  null  hypothesis using  significance  testing.

For  example,  we  cannot  reject  the  null hypothesis  for  support  vector  machines.

On  the  other  hand,  if  a   Confidence Curve

crosses  the  vertical  line  below the  horizontal  line,

you  can  reject  the  null  hypothesis using  significance  testing.

For  example,  we  can  reject  the  null hypothesis  for  boosted  tree.

How  are   Confidence Curves  computed?

The  current  implementation of  confidence  curves  assumes

the  differences  are  computed

using  R times   repeated K-fold  cross  validation.

The  extent  of  each confidence  interval  is  computed

using  what  is  known  as a  variance  corrected  resampled  T-test.

Note  that  authors  Claude  Nadeau and  Yoshua Bengio,

note  that  a  corrected resampled  T-test

is  typically  used  in  cases where  training  sets

are  five  or  ten times  larger  than  validation  sets.

For  more  details,  please  see  the  paper associated  with  this  presentation.

So  how  are   Confidence Curves  interpreted?

First,  the   Confidence Curve graphically  depicts

the  main  difference  in  the  metric of  interest  between  a  given  method

and  a  reference  method at  the  0%  confidence  level.

So  we  can  evaluate  whether  the  mean difference  between  the  methods

is  meaningful.

If  the  main  difference  isn't  meaningful, there's  little  point  in  further  analysis

of  a  given  method  versus  the  reference method  with  respect  to  the  chosen  metric.

What  constitutes  a  meaningful  difference depends  on  the  metric  of  interest

as  well  as  the  intended  scientific or  engineering  application.

For  example,

you  can  see  the  model  developed with  a  decision  tree  method

is  on  average  about 14%  worse  than   Fit Step Wise,

which  arguably  is  a  meaningful  difference.

If  the  difference  is  meaningful,

we  can  evaluate  how  precisely the  difference  has  been  measured

by  evaluating  how  much the  Confidence C urve  width  changes

across  the  confidence  levels.

For  any  confidence  interval  not  crossing the  default  vertical  axis,

we  have  at  least  that  level  of  confidence that  the  main  difference  is  nonzero.

For  example,

the  decision  tree  confidence curve  doesn't  cross  the  Y  axis

until  about the  99.98%  confidence  level.

We  are  nearly  99.98%  confident the  mean  difference  isn't  equal  to  zero.

In  fact,  with  this  data,

it  turns  out  that we  can  be  about  81%  confident

that   Fit Step Wise  is  at  least  as  good,

if  not  better,  than  every  method  other than  generalized  regression  Lasso.

Now  let's  consider  the  relationship between   Confidence Curves.

If  two  or  more   Confidence Curves significantly  overlap

and  the  mean  difference  of  each  is  not meaningfully  different  from  the  other.

The  data  suggest  each  method  performs about  the  same  as  the  other

with  respect  to  the  reference  model.

So  for  example,

we  see  that  on  average, the  Support  vector  Machines  model

performs  less  than  . 5% than  Bootstrap  Forest,

which  is  arguably not  a  meaningful  difference.

The  confidence  intervals  do  not  overlap until  about  the  4%  confidence  level,

which  suggests these  values  would  be  expected

if  both  methods  really  do  have about  the  same  difference  in  performance

with  respect  to  the  reference.

If  the  average  difference  in  performance

is  about  the  same for  two  confidence  curves,

but  the  confidence  intervals don't  overlap  too  much,

the  data  suggests

the  models  perform  about  the  same  as  each other  with  respect  to  the  reference  model.

However,  we  are  confident of  a  non- meaningful  difference.

This  particular  case is  rare  than  the  others

and  I  don't  have  an  example to  show  with  this  data  set.

On  the  other  hand, if  the  average  difference  in  performance

between  a  pair  of   Confidence Curves is  meaningfully  different

and  the  confidence  curves have  little  overlap,

the  data  suggests  the  models  perform

differently  from  one  another with  respect  to  the  reference.

For  example,  the  generalized  regression Lasso  model  predicts  about  13.8%

more  of  the  variation  in  the  response than  does  the  decision  tree  model.

Moreover,  the   Confidence Curves don't  overlap

until  about  the  99.9%  confidence  level,

which  suggests these  results  are  quite  unusual

if  the  methods  actually  perform  about the  same  with  respect  to  the  reference.

Finally,  if  the  average  difference in  performance  between  a  pair

of   Confidence Curves  is  meaningfully different  from  one  another

and  the  curves  have  considerable  overlap,

the  data  suggests  that

while  the  methods  perform  differently  from one  another  with  respect  to  the  reference,

it  wouldn't  be  surprising  if the  difference  is  furious.

For  example,  we  can  see  that  on  average

support  vector  machines  predicted  about 1.4%  more  of  the  variance  in  the  response

than  did  [inaudible 00:19:18]   nearest  neighbors.

However,  the  confidence  intervals  begin  to overlap  at  about  the  17%  confidence  level,

which  suggests  it  wouldn't  be  surprising

if  the  difference in  performance  between  each  method

and  the  reference  is  actually  smaller than  suggested  by  the  point  estimates.

Simultaneously,

it  wouldn't  be  surprising  if  the  actual difference  is  larger  than  measured,

or  if  the  direction of  the  difference  is  actually  reversed.

In  other  words,  the  difference in  performance  is  uncertain.

Note  that  it  isn't  possible  to  assess the  variability  and  performance

between  two  models relative  to  one  another

when  the  differences  are relative  to  a  third  model.

To  compare  the  variability  and  performance

between  two  methods relative  to  one  another,

one  of  the  two  methods must  be  the  reference  method

from  which the  differences  are  measured.

But  what  about  multiple  comparisons?

Don't  we  need  to  adjust  the  P-values

to  control  the  family wise   type 1  error  rate?

In  this  paper  about   Confidence Curves,

Daniel  Burrough  suggests  that  adjustments are  needed  in  confirmatory  studies

where  a  goal  is  prespecified, but  not  in  exploratory  studies.

This  idea  suggests  using  unadjusted P-values  for  multiple   Confidence Curves

in  an  exploratory  fashion

and  only  a  single   Confidence Curve generated  from  different  data

to  confirm  your findings  of  a  significant  difference

between  two  methods  when using  significance  testing.

That  said,  please  keep  in  mind  the  dangers of  cherry  picking,  p-hacking

when  conducting  exploratory  studies.

In  summary,  the  model  screening  platform introduced  in  JMP  Pro  16

provides  a  means to  simultaneously  compare  the  performance

of  predictive  models  created using  different  methodologies.

JMP  has  a  long  standing  goal to  provide  a  graph  with  every  statistic,

and  Confidence Curves help  to  fill  that  gap.

For  the  model  screening  platform.

You  might  naturally  expect  to  use significance  testing  to  differentiate

between  the  performance  of  various methods  being  compared.

However,  P-values  have  come under  increased  scrutiny  in  recent  years

for  obscuring the  size  of  performance  differences.

In  addition,

P-values  are  often  misinterpreted  as  the probability  the  null  hypothesis  is  true.

Instead  of  P-value  is  the  probability of  observing  a  difference

as  or  more  extreme, assuming  the  null  hypothesis  is  true.

The  probability  of  correctly  rejecting the  null  hypothesis  when  it  is  false

is  determined  by  a  power or  one  minus  beta.

I  have  argued  that  it  is  not  uncommon to  only  have  a  50%  chance

of  correctly  rejecting  the  null  hypothesis with  an  alpha  value  of  .05 .

As  an  alternative,  a  confidence  interval could  be  shown  instead  of  a  loan  P- value.

However,  the  question  would  be  left  open as  to  which  confidence  level  to  show.

Confidence Curves  address  these  concerns by  showing  all  confidence  intervals

up  to  an  arbitrarily high- level  of  confidence.

The  mean  difference  in  the  performance

is  clearly  visible  at  the  0%  confidence level,  and  that  acts  as  a  point  estimate.

All  of  the  things  being  equal, type 1  and  type  2  errors  are  equivalent.

Confidence Curves  don't  embed  a  bias towards  trading  type 1  errors  for   type 2.

Even  so,  by  default,

a  vertical  line  is  shown in  the  confidence  curve  graph

for  the  standard  null hypothesis  of  no  difference.

In  addition,

a  horizontal  line  is  shown  that delineates  the  95%  confidence  interval,

which  readily  affords  a  typical significance  testing  analysis  if  desired.

The  defaults  for  these  lines are  easily  modified,

but  different  null  hypothesis and  confidence  levels  desired.

Even  so,  given  the  rather  broad and  sometimes  emphatic  suggestion

to  replace  significance  testing  with  point estimates  and  confidence  intervals,

it  may  be  best  to  view a   Confidence Curve  as  a  point  estimate

along  with  a  nearly  comprehensive  view of  its  associated  uncertainty.

If  you  have  feedback about  the  confidence  curve's  add-in,

please  leave  a  comment on  the  JMP  community  site.

And  don't  forget to  vote  for  this  presentation

if  you  found it  interesting  or  useful.

Thank  you  for  watching  this  presentation and  I  hope  you  have  a  great  day.