cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Box Plot Analyses: Blending Scientific/Artistic Enquiry in Univariate Response Characterization (2022-US-45MP-1111)

Patrick Giuliano, Senior Analytical Technical Support Engineer, JMP (SAS)
Mason Chen, Black Belt Student, Stanford University OHS
Charles Chen, Master Black Belt / Continuous Improvement Expert, Applied Materials

 

Through JMP 16 outlier and quantile box plots (distribution), together with quantile range outlier and robust fit outlier detection (screening), we present comprehensive strategies to powerfully separate signal from noise in the presence of univariate response(s). We also propose that through practical analysis with the box plot, we can connect the Gauge R&R noise impact with location of the points most adjacent to the upper and lower fences. We use Monte Carlo sampling (random() function and instant columnfFormulas) to produce multiple distribution types (normal, uniform, peaked, bimodal) to validate the impact on the box plot and histogram together to detect normality violation failure modes. 

 

We demonstrate that the box plot is a powerful visualization tool to judge the data distribution, unique to separate skewness from outliers. Graph Builder, one-way, GoF, and nonparametric hypothesis testing show that – since box plot is very weak to detect bimodality, kurtosis, or draw hypothesis test decisions (missing sample size effect) – both the histogram and box plot are needed to visualize normality. Together with descriptive statistics, the most powerful discrimination between different candidate distributions is presented. Finally, we synthesize and demonstrate our learning experience by formulating 17 thought-provoking quiz questions and answers to maximize the utility of the box plot for data-driven problem solving.

 

 

Well,  thank  you  everyone  for  joining  me.

This  is  a  Discovery Summit  2022  presentation,

courtesy  of  my  co-presenters, Charles  Chen  and  Mason  Chen.

My  name  is  Patrick  Giuliano.

The  title  of  this  talk  is,

Box Plot A nalysis: Blending  Scientific  and  Artistic  Enquiry

in Uni variate  Response  Characterization.

Here's  the  abstract.

You  can  find  this on  the  JMP  User C ommunity

in  the  Discovery  2022  community  page,

US D iscovery  2022  community  page.

I'm  putting  it  here  for  reference.

I  will  provide  a  link  in  the  slides

to  the  community  page where  the  project  will  live.

What's  the  motivation  for  this  project?

The   Box Plot  is  one of  the  most  popular  graphical  tools

to  visualize  a  univariate   distribution  of  data.

This  project  studies  how  to  use the   Box Plot  to  analyze  data  effectively.

Most  people  who  use the   Box Plot  don't  use  it  necessarily

to  determine  the  shape of  the  distribution  of  the  data.

In  fact,  many  people  use  it  wrongly to  draw  mean  or  mean  comparison  decisions,

and  they  may  assume normality  based  on  symmetry,

when  in  fact, the  normality  assumption

would  actually  not  be  reasonable if  they  were  to  take  a  closer  look

at  the  shape  of  the  data on  a  histogram,  for  example.

The  objective  of  this  project is  to  demonstrate  how  to  use  JMP

specifically  16, to  interpret  information  in  a   Box Plot

and  to  improve  proficiency

in   a  global  community of  scientists  and  engineers

that  are  really  under  a  DMEIC  or  APS,

or  Lean  type  Six  Sigma  methodology,

which  is  very  popular,  obviously,  today and  over  the  last  few  decades.

The  interesting  thing  about   this  project  is  we  framed  it

in  the  context  of  17  quiz  questions.

This  is  a  question and  answer  slide  deck.

And  I'm  not  going to  go  into  too  much  detail

about  each  and  every  question,

whic h  I  will  show  you  here.

But  what  I'd  like  to  do is  show  you  a  little  bit  about

how  you  can  use  JMP  to  explore the  answer  to  these  questions,

because  I  think  that's  really the  most  interesting  and  fun  part.

The  first thing  I  wanted  to  do  is  just   quickly  go  over  what  a   Box Plot  is.

So  what  is  the  anatomy  of  a   Box Plot?

Just  as  a  refresher  for  some  of  you, or  introduction  for  some  of  you,

the  median  is  indicated  by  the  midline,

and  it's  referred  to  as the  second  quartile,  Q2

or  the  50th  percentile.

Then   Q1  is  referred to  as  the  25th  percentile.

Q3  is  the  75th  percentile, as  you  can  see  here.

The  interquartile  range,  or  IQR,

is  the  difference between   Q3  and   Q1.

The  other  important  elements are  we  have  what's  called

a  whisker  on  the  lower  side and  on  the  upper  side.

Right at  the  end  of  that  whisker, sometimes  we  refer  to  it  as  the fence

and  you'll  see  a  vertical  line

and  JMP  draws  a  vertical  line   to  indicate  that  edge.

What's  important  about  this  is that  this  location  is  actually

Q1 minus  1.5  times  the  IQR,

which  is  represented  by  the  distance between  this  edge and this edge.

This point  and  the  upper  fence  is

Q3 plus one  and  a  half  times  the  IQR.

So   that  defines  the  upper  edge.

Then  any  points  that are  beyond  these  edges  or  fences

are  considered  potential  outliers.

And  they  actually  show  up   by  themselves  as  points

whereas  the  rest  of  the  data in  the  middle  of  the  histogram

is  not  shown  for  emphasis  on  the  points that  are  beyond  the  beyond  these  fences.

I'm  going  to  jump  right  in.

How  did  we  explore  and  develop the  answers  to  these  questions,

and  in  some  cases, even  refine  the   questions  themselves?

Well,  we  created a  simulated  data  table in  JMP  16,

where  we  constructed  100  rows  of  data,

and  we  constructed  data, first from  a  normal  distribution,

and  then  applied  some transformation  to  that  data.

We  see  that  we  have a  normally  distributed  data

drawn  from  a  population  the  mean of  zero  and  a  standard deviation  of  one.

Then we  have  uniformly  distributed  data.

Then  we  have  data  that's  peaked,

i.e  has  a  positive  Kurtosis.

Then  we  have  data  that's  right  skewed,

has  two  modes,

has  some  outliers  about  3 %  on  average

and  then  integers.

In  all  the  cases, with  the  exception  of  the  bimodal,

we  just  based  the  simulated  formula,

on  the  original  normal  column.

The  way  that  we  put  all the  data  on  the  same  scale

is  we  use  the  column  standardized  function

so  that  we  could  compare  all  the  data

relative  to  each  other in  the  distribution  platform.

This  is  just  a  preview  of  that.

I'll  jump  over  to  JMP and  show  you  that.

But  again,  all  of  this  data  is  centered at a mean of   approximately  zero

and  a  standard  deviation, approximately  one.

We  covered  the  first question.

Why i s  a   Box Plot,  sometimes  referred to  as  a  five- point  plot?

Well,  there  are   five  main  points .

There's   Q1,  there's   Q2 in  the  middle,  there's   Q3,

and  then  there  are  the  whiskers, the  upper  and  lower .

Next  question.

What  are  the  two  ways that  the   Box Plot  can  determine

whe ther the  distribution  is  skewed?

Well,  we  can  look at  the  width  of  the  box  itself.

We  can  also  look at  the  width  of  the  whiskers.

In  this   right  skewed  example,

you  can  see  that  upper  whisker is  much  longer  than  the  lower  one.

So  that  would  imply that  the  data  is  right  skewed.

In  other  words,  that  the  tail  in  the  data,

if  you  were  to  imagine  that  distribution, is  pointing  to  the  right.

Third, why  does  the  Box Plot include the  median  and  not  the  mean?

Well,  a   Box Plot   uses  the  median to  determine  or  gauge  skewness.

So  if  the  distribution  is  normal, then  the  mean  is  equal  to  the  median.

And  in  fact,  what  you  would  see  here is  that  this  median  line  in  the  middle

would  line  up  exactly with  the  edges  of  this  diamond,

the  middle  of  this  diamond.

In  that  case, you  would  effectively  have  a  situation

where  you're  not really  losing  any  information

because  the  distribution is symmetric.

The  median  in  general, then  might  be  considered  better,

regardless  of  whether the  distribution  is  normal  or  non-normal.

Fourth,  why  is  the  Box  Plot the  most  powerful  visualization  tool,

or  one  of  the  most  powerful  tools to  separate  skewness  and  outlier  problems?

When  we  talked  about  this  idea that  because  the   Box Plot  uses

this  Q1  minus  one  and  half  times  IQR

and  Q3 plus one  and  a  half times  IQR  methodology,

it  really  allows  us  to  separate potential  outliers  from  the  main  data.

It  also  gives  us  a  framework

by  which  to  judge whether  the  upper  whisker

is  larger  or  smaller  than  a  lower  whisker.

So  those  two  components  of  the  plot

really  help  us,  rather, see  if  we're  trained  skewness

and  potential  outlying  this.

This is  a  unique  feature  of  the   Box Plot.

The  fifth question is  a  little  more  interesting.

What's  the  relationship between  the  interquartile  range,

that  distance  between   Q1  and   Q3,

and  the  standard  deviation, which  we  can  calculate  for  any  data  set,

regardless  of  how  it's  distributed?

If  the  data  is  normal,

what  about  if  the  data is  skewed  or  non-normal  or  peaked

or  any  other  shape?

Well,  we  know,  based  on  theory,

that  the  ratio  of  the  IQR, to the standard deviation

is  1.35  for  normal  data.

What  would  that  ratio  look like  if  the  data  wasn't  normal?

Well,  we  can  explore  that  in  JMP,

and  I'm  going  to  show you  that  really  quickly.

Here's  the  data  set.

I'm  also  going  to  post this  on  the  community .

The  first thing  I'm  going  to  do is  go  ahead  and  show  you

how  I  get  to  a  visual  state

where  we  can  see  all  the   Box Plots  together,

without  the   distributions.

This  is  interesting,

but  I'm  going  to  go  ahead and  start  from  the  beginning.

I'm  going  to  analyze  distribution.

I'm going to  show you  how  I  got  there.

I'm  actually  going  to  click  everything,

and  JMP  is  going  to  give  me a  histogram  and  a   Box Plot  together.

A t  the  end  of  the  presentation,

we're  going  to  summarize why  that's  important.

But  what  I'm  going  to  do  here  is  I'm  going to  go  ahead  and  turn  off  the  histogram,

I  can  go  ahead  and  customize the  width  here  of  the  lines.

I  can  copy  this  customization  over,

which  is  really  nice.

I'm  going  to  hold  down the  control  key  because  I'm  on  a  PC.

I'm  going  to  right  click

and  then  I'm  going  to  hit  Edit, Copy, P aste C ustomizations,

and  that's  going  to  bring  them  all  over.

I'm  actually  going  to  hold  down the  control  key  again  and  resize  this

so that I  can  resize  them  all  together.

Now  I'm  going  to  minimize the  quantile  section,

because  I'm  going  to  get the  information  that  I  need

from  the  summary  statistics  section.

I  actually  have  the  IQR and  a  standard  deviation  shown  here.

A lthough  I   could  customize  this, either  here  or  in  the  properties,

which  I  can  access  under  File,  Preferences and  under  the  distribution  platform  group.

What  I'm  going  to  do is   I'm  actually  going  to  make

this  information  into  a  data  table

and  I'm  going  to  right- click  and  select, Make  Combined  Data  Table  to  do  that.

Now  I  only  need  the  IQR and the  standard  deviation.

I'm  really  only  interested  in  that.

So  I'm  actually  going  to  select, one  of  the  standard  deviations,

one  of  the  IQRs.

I'm  going  to  move  my  cursor  over  here

and  select  Matching  on  all  of  the  rows

that  have  these  values  in  them.

I'm  going  to  go  ahead and  invert  the  selection,

delete  the  rows  that  I  don't  want,

and  I'm  left  with  this.

Now  I'm  just  going  to  go  ahead   and  restructure  the  data

so  that  I  can  calculate  the  ratio  of  the  IQR  to  the  standard  deviation.

So  I'm  just  going  to  use Table  Split  for  that.

I'm  going  to  go  ahead and  split  by  column 1

put  column  2  in  here, put  these  in  a  group.

I'm going  to  click  OK.

I  have  the  data  how  I  want.

This  shows  me  from  which  distribution

this statistics  came  from,

I'm  going  to  go  ahead  and  do a  New  Formula C olumn,  Combine,  Ratio.

There  you  go.

This  looks  a  little bit  hard  to  interpret  for  me.

I'm  going  to  go  ahead  and  change  it so  that  I  can  only  see  two  decimals.

I've  got  numbers  which  are  very  similar,

it  should  be,  anyway,  very  similar to  what  I  have  in  the  slide here

detailing  the  ratio  of  the  IQR to  the  standard  deviation.

Of course,  they're  going  to  be  different because  there's  sampling  error.

This  table  is  only  one sampling  experiment.

But  this  is  how  I  can  quickly

and  interactively  extract  this  information

and  really  understand, what  does  this  ratio  look  like

if  my  data  is  not  normal in  a  particular  way?

We  can  see  here, that  the  values  that  tend  to  be  lower

at  the  peak  distribution of  the  one  with  outliers,

the  values  that  tend to  be  higher  than  the  typical

or  the  expected theoretical  1, 3, 5  normal,

are  going  to  be  the  uniform, the  right  skewed  and bimodal.

Next  question, what's  the  ideal  outlier  percent

if  the  distribution  is  perfectly  normal?

Well,  it  turns  out  that  if  we  look in  the  textbooks  or  reduce  simulation,

on  average,  we  should  see  about  0.7% of  the  points  beyond  the  fences,

in  a  normal  distribution,

or  at  least  perhaps  not  beyond  the  fences, but  if  we  were  to  do  a  control  chart,

we  would  certainly  see  about,

which  is  under the  assumption  of  normality.

For  example,

if  we  were  to  do  an  individual  moving range  chart,  we  would  see  around  0.7%

of  the  points  on  average being  outside  the  limits.

Although  for  practical  purposes,

we  could  probably  say,

if  we  saw  about  3 %  or  less of  the  distribution  beyond  the  limits,

we  would  consider  it  approximately  normal.

Why  is  that  question  important?

Well,  we  can  use  the  proportion of  the  points  beyond  the  fences

in  a  Box Plot   when the  sample  size  is  small

to  determine  whether  or  not we  have  some  evidence  of  normality

on  the  basis  of   outliers.

Although  if  our  sample  size  is  too  big,

then  we're  going  to  see  lots and  lots  of  points  beyond  those  fences.

So  it's  really  important  that  we  consider a  " reasonable  sample  size."

And  that's  part  of  the  reason  why  we only  considered  100  rows  in  our  project.

Next  question, what's  the  difference  between

a  quartile  range and  a  quantile  range   Box Plot?

Well,  in  a  practical  context, anyway,  we  can  talk  about

the  Explore Outlier  utility  in  JMP  16,

which  a llows  us  to  adjust  the  Q,

which  is  the  multiplier  on  the  IQR

and  the  tail  quantile,

which  is  essentially how  the  data  is  divided  up.

We  can  customize  that  range.

I'm  just  going  to  show  you what  that  looks  like  real  quickly.

I'm  going  to  go  into  Analyze,

I'm  going  to  go  to  Screening,

Explore  Outliers.

I'm  going  to  do  this  on  my  raw  data.

I'm  actually  going  to  close  this.

I'm  going  to  go  back to  the  raw  data  table.

I'll  just  pick  a  couple  of  these.

I'll  actually  pick  the  ones  that  I  have in  my  slides  to  peak  in  the  outliers.

I'm  going  to  go  ahead and  use  the  quantile  range  outliers.

I'm  going  to  adjust  this to  what  the   Box Plot uses: 0.25  and  1.5.

I'm  going  to  click  Rescan

and  JMP's going to  identify potential  outliers  here.

How  does  this  connect to  the  distribution  platform?

Well,  if  we  go  over  here,

we  look  at  this,

what  we're  going  to  see,

is  there  are a  number  of  outliers  here.

I'm  actually  going  to  select  the  rows,

I'm  going  to  go  over  here.

Well,  lo  and  behold, it's  these  values.

So  you  got   1, 2, 3, 4, 5, 6,7.

There's  seven  outliers.

1, 2, 3, 4, 5, 6,7.

That  squares  up.

That's  exactly  what  we  would  expect.

Similarly,  we've  got   1, 2, 3, 4,

and  if  we  scroll  over  here, and  under  the  outliers,

see  if we're over  here  are  to  four.

Great.

Going  back  to  the  slides  here, we  can  customize  this .

And  that's  actually  what  we  get into  in   subsequent  Question  10.

How  do  we  determine  whether outliers  are  marginal  or  extreme?

Well,  and  why  is  it  important?

Well,  we  can  adjust  the  sensitivity

of  the  outlier  detection  based on   the  multiplier  on  the  IQR

while  keeping  the  tail quantile  the  same.

You  might  intuitively  expect

that  if  you  were  to  take  Q₃ plus a  larger  number  times  the  IQR,

it's  going  to  extend  the  whisker  length and similarly,  on  the  lower  side.

That's  going  to  mean that  more  points  are  going  to  fall  inside.

So  less  outlier  would  be  detected.

We  should  be  able to  see  that  and  test  that  in JMP.

So  if  I  were  to  increase this  to  two  and  click  Rescan,

we  see  a  few  outliers become  part  of  the   Box Plot,

or  we  can  imagine a  situation  where  that's  the  case.

I'll  increase  this  to  three,

I'll  hit  Re scan

we  see  even fewer outliers  being  identified  still.

A s  I  go  up  to  Q equal to  five,

now  I  only  have  one  outlier  detected in  the  peak  column  of  data.

So  the  idea  here  is  that  we  can develop  criteria  for  Q,  for  example,

we  might   situate  it  with  three, a situation  where  data  might  be  considered

a  typographical  error,  where  it  might  be   an  extreme  or  more extreme outlier.

We  might  set  Q  equal  1.5 if, for example,

we think that the potential outlier  might  be  associated  with  variability

due to  the  measurement  system or special  process  variation.

We  can  do  some  simulation based on  our  application

and decide on what the  value  of  Q should be in  these  particular  scenarios.

In connection  with  that, in  Question  10,

we  touched  a  little  bit  on   GRR or  measurement  system  variability.

Question  8  talks   a  little  bit about,

it goes  a  little  bit  deeper  into  this and  brings  together  some  ideas.

The  idea  here  is  that   we  might actually consider

the  distance  between   the  upper  fence  and  the  first outlier

or  the  first potential outlier  series  of  outliers.

We  may  extend  that  upper  fence   by  a  distance  of  two times the Sigma

due  to  the  measurement  system  variability.

In  this  way, we're actually considering

the  variability due to  the  measurement  system.

And  we're  asking  ourselves, is  this  potential  value

within  the  noise  of  the measurement  system  or  not?

We're  creating  a  graphical  way, a blended graphical means

of  determining  whether   the  value  is  reasonable

under the  expectation  tha t there's measurement  system  variability.

I  have  here  the  distance  between   the  marginal  outlier and the whisker

should  be  compared   to  the  GRR  noise  standard  deviation.

If  it's  within  two  standard  deviations, we  don't  have  95 %  confidence

to conclude this  marginal  outlier is different  from  the  whisker.

This  is  just  a  graphical  version of  a  one- sample  T- test  in  effect.

We  could  actually  construct   a  one- sample  T- test

using this red line as our  target   and  the  observed  value,

or  rather  assumed  series  of  values, this black dot,

as  our  distribution relative  to  that  target.

The  next  question,  how  many  points do we  need  really  to  produce  a   Box Plot

if  we're  sample  size  limited?

Well,  we  might  need   at  least  seven  points,

and our simulation  in  this  particular sampling  experiment  shows  that.

What's  happening  here?

Well,  each  of  these  three  data sets  have  the  same  median.

You  can  see  in  this  data  set, there  are  six  observations.

In  this  one  there  are  seven, and  then  this  one  there  are  eight.

Let's  start  on  the  left,  actually,

and  we  have  eight  observations, one  out  here  around  15.

What  if  we  reduce  the  number  of observations to seven

and  we  actually  included   the  same observation here,

but  we  reduce  one  of  the  others?

What  if  we  reduce  it  further while  maintaining  the  same  median?

Then  what  we  see  is  that  this outlier 15,

which  is  still  in  the  data  set,   no  longer  becomes  an  outlier.

In  essence,  it  becomes  absorbed  into the whisker  itself.

The  other  thing  that's  interesting about this simple experiment,

is  that  the  IQR  becomes  inflated when we go from seven to six.

We  can  see  that  visually  as that

the width of this box from Q1 to Q3   becomes  much  wider.

We  can  also  see  that  numerically  here.

I  actually  want  to  show  you how  we  might  explore  that  in  JMP.

Here's  some  data.

It's  not  the  same  data, but  here's  some  data.

I  just  created  a  column that  ranks  the  data.

A gain,  I  just  use an  instant  column  formula.

I  can  do  that  by  selecting   one of these options,

so  I  believe  it's  under  distributional.

Now,  what  I'm  going  to  do  is I'm going  to  go  ahead  and  just  clock  this  data.

I'm  going  to  turn the  histogram  on  its  side.

I'm  actually  going  to  invoke   the  local data filter.

I'm  going  to  bring  in  that  rank  column   that  I'm  going  to  make it ordinal first,

so  that  I  can  select  data  individually

rather  than  under  the  assumption of  the  continuous  distribution.

I'm  going  to  select  everything.

Now,   let's  see, if  I  go  back  to  the  data  table,

I  know  that  8  represents the  highest,  the  largest  value.

I'll  keep  8  in  there,

and  I'll  just  start  reducing   some  of  the lower values

by  holding  down   my  control  key  and  clicking  that,

which  will  effectively  remove  that  point dynamically  from  this  analysis.

I  got  the  control  key  down   and  clicked  again,  click again.

You  saw  it  there, that one outlier at the low side,

anyway,  in  this  case,   just  disappeared.

There's  a  relationship  among  the  distance  between the  fences  and  the  points,

which  is  calculated   on  the  basis  of  the  data

where the median  and the quartiles  are  calculated

based  on  the  data   that's  in  the  analysis.

This  gives  you  a  better  means of  appreciating  how  the   Box Plot

is changing as  a  function  of  data that's either  in  or  out  of  the  analysis.

This  is  a  really  super  cool  feature

that I really  like  to  use  a  lot  in  many  contexts.

What's  the  advantage of a Robust Fit Outlier  algorithm,

which  is  a  JMP  16  algorithm?

It  gives  us  another  means   of  detecting  outlyingness .

We  have  the  ability   to  use  a  Cauchy  method

which  often  avoids   the  impact of skewness,

which  can  be  useful for  practical  situations.

We  can  also  use  a  3-s igma   or a K-sigma multiplier

in  order  to  help  detect  outlyingness .

All  of  these  methods  really  help  us

separate potential outliers  from real outliers

and  help  us  create  a  reasonable signal detection and  methodology

in a similar way that we might  do   if  we  were  to  use  control  charting

and build a control chart with  limits

for  our  particular  experimental   or  manufacturing  application.

13.  Can  we  include  the  sample  size information  in  the  Box  Plot?

Well, this  is  where  the  Box  Plot  starts to  present  a  clear  limitation.

There  isn't  any  sample  size  information   explicitly  in  the   Box Plot.

A lthough,  we  do  have   the  ability in graph builder

to create a notch Box Plot,  which  gives  you  something

like a  confidence  interval   on  the  median,

the  edges  indicate   a  confidence  interval  on the  median.

We  also  have  the  ability  in  graph  builder to invoke the caption box

which is a very useful  feature   for  summarization  of  data

graphically  without  needing  to  provide an  additional  tabular  data  output.

But  of  course,  that  information  is completely  hidden  to  the   Box Plot  itself.

Connected  to  that  is, can  we  make  any  decision

with any level of  statistical  confidence if  we're  just  looking  at  the   Box Plot?

The  answer  is  no.

In  this  particular  example, we actually designed it

so  the  medians were  slightly  different  on  average.

And  so  we're  getting  some  separation among  the  medians  between  the  groups.

We  used  to  fit Y by  X  in  this  context.

What  this  shows  is  that  the  mean [inaudible 00:29:27]

represents the  mean, the mean diamonds  are  non- overlapping.

It  looks  like  all  across   all  four groups being compared,

which  indicates  that  there's  some evidence

that  there's  a  difference in  the  means  between  the  groups.

We  can  also  see   the  difference   in  the  medians.

We  can  do  a  non- parametric  test.

In  this  case,  we're  using   a  non- parametric steel test with control,

where  the  control  is  just  the  Z  normal.

We're  seeing  some   evidence of separation,

statistical  separation  among  the  medians  in this particular instance.

It's  hard  for  us  to  detect  that and  see that in the Box Plot.

In  fact,  it  really  isn't   that  clear  at  all.

How  can  we  tell  if  we  have  any concern  with  respect  to  Kurt osis ?

What's  Kurtosis?

Kurtosis is  basically  the  idea  that  if it were a  positive  Kurtosis,

you  would  have  data   that's  concentrated  in  the  middle,

your  data  that's  squished  together   into  the  middle  of  the  distribution.

That's  this  example  in  the  right.

If  you  had  an  idealized  case of extreme negative  Kurtosis,

you'd  have  a  uniform  distribution where  the  data  is  really  spread  out.

What  you  can  see  in  these  graphs relative  to  the  normal  distribution,

is  that  the  50 %  dense  zone, indicated  by  this  red  bar,

is  basically  about  as  long  as  the distance between  Q1  and  Q3  here,

but  it's  on  one  side  of  the   median,

on  one  side  of  the  median and  the  uniform  case.

It's  about  as  long  as  this  box  width then  it's  also  on  one  side  of  the  median.

That's  a  unique  characteristic  feature of  this  uniform  distribution  shape.

If  we  look  at  the  peak  situation,

we  see  that  the  box  width   is much  more  compressed

and the shortest half width  is also about the same as  the  box width,

the shortest half is the  most  dense  region rather  as  centered  about  the  median.

That's  similar  to  what  you  would  see for the normal distribution case,

where  the  50 %  dense  region   will  be  about centered

on  the  mean  or  the  median and  about  the  same  width  as  the  box.

Clearly,  the  differentiator  here  is that the  distances  are  reduced  quite  a  bit.

Really  the  takeaway  here,  though,

is  that  this  type  of  interpretation is  really  difficult.

And  it  would  be  easier  for  us  to rely on the  shape that's evinced by the histogram

than  to  try  to  look  at  the   Box Plots  separately.

Question  16  is  very  similar to 15.

What  about  in  the  context  of  data that  has  more  than  one  mode?

What  about  a  bimodal  distribution?

Well,  I  just  took  the  Box Plots   and  pulled  them  out on the left,

they're  from  the  pictures  on  the  right.

We  can't  really  see  a  whole  lot of  difference  among  these.

It's  difficult  for  us  to  interpret  this.

But  once  we  put  the  histograms, we can see clearly  if  we  fit  a  two- peak distribution

that  there's  two  modes  in  this  data,   and  there's  maybe  one  mode,

maybe  a  small  mode,  but  really  essentially one  mode  in  the  data  on  the  left.

The   Box Plot  isn't  particularly good

at  detecting  that  presence  of  multiple  modes.

The  last  question  is,  how  many   "normality violation failure modes"

can  we  detect  with  the   Box Plot?

This  question  brings all  the  other  ones  together.

Well,  if  we  have  skewness,

we've  shown  that  we  have a  strong  ability  to  detect  that.

If  we  have  potential  outliers,

we  definitely  have   a  strong  ability  to  detect  that.

If  we  have  Kurtosis,

which  is  really  related  to  the  shape   as is  if  there  are  multiple  modes,

then  we  really  don't  have a  strong  ability  to  detect  that.

If  we're  considering   hypothesis  testing,

we definitely don't have an  ability to detect that  either  with  the   Box Plot.

What's  the  takeaways?

Well,  the   Box Plot  is  definitely  a powerful visualization tool.

It's  a  great  introductory  tool,

and it  has  a  wonderful  ability   to  separate skewness

from  potential  outlying ness.

But  it  has  its  limitations.

In  cases  where  we're  looking at  Kurtotic  shape  or  a  bimodality

or multimodality,   the  histogram is  definitely  a  better  choice.

That's  really  probably  why  JMP  uses both the Histogram and the Box Plot together

in  the  distribution  platform to visualize how the data is behaving,

if  you  will.

Of  course,  adding  descriptive  statistics

helps us really  round  up  the  picture   where we  have  a  graphical  first approach.

This  is  just,  again,  a  summarization   of  what  we've  discussed.

But the  last  couple  of  minutes,

I  just  want  to  show  you  a  couple  more things  about  the  data  set  itself.

Because  I  think  this  is  perhaps the  most  useful  aspect  of  the  project.

How  might  we  set  up  a  data  set  like  this?

All  we  really  have  to  do   to  simulate  data in JMP

is  just  create  some  rows   and then create  a  function,

a random  normal  function.

The  process  that  we  did, one way you  could  do  this  is  you  could say,

okay,  you  can  go  into  a  column  formula…

Let me  just  show  you  this.

You  can  just  double- click  into  it, and  you  can  click  Formula,

you  can  edit  the  formula.

You  can  go  over  here   to  these  random functions,

you  can  click  it  in,  and  then  you  can   specify  a  population  mean in Sigma.

Zero  and  one  by  default,  click  OK,   and  then  I  can  add  a  bunch  of  rows.

I'll  go  ahead  and  add  100  rows.

What  about  these   other  distributions?

A uniform  distribution is,   we  can  use  the  random  uniform  function.

And  then  we  can  specify a  Min  and  Max  value.

In  this  case,

I  just  specified  the  minimum of this  column,

this  normally  distributed  data  column, and  the  Max is  the  maximum.

And  then  finally,   as  I  mentioned,

I  standardized  the  column  so  that  it was  on  the  same  numeric  scale.

This  standardize  this  column,

standardized  feature  is  common  to  all  of  these  columns.

Now,  the  last  thing  I  want   to  talk about real quick

is,  well,  what  about  peak?

What  about  right  skewed,  and  even  bi modal?

Well,  one  of  the  things  we  can  do, which I really think is cool,

is  we  can  use   the  distribution  calculator in JMP

to  help  us   understand what certain  distribution  types  look  like.

I'm  just  going  to  go  into  it  here. I'm  going  to  just  drive  down  in  here.

I'll  share  with  you   the  location  here  of  this  script.

It's  going  to  be  under  Calculator. It's not.

It's  going  to  be  under Distribution.

Generator.

Distribution  calculator,   on the calculators,  yes.

How  might  I  create  a  distribution  that's  right  skewed?

Which  random  function   would  I  use?

Well,  I  have  the  ability  to  look   at some of  these  distributions and see

for  example,

if  I  specify  a  random  F   and  I  specify these parameters,

then  I'm  getting  a  distribution   with  this  kind of skewness.

And  then  I  can  say,  well,

what  happens  if  I  change   these  parameters  a  little  bit?

How  is  that  going   to  change  the  distribution?

I  can  use  this  insight   to  specify the parameters

for  the  random distributions that  I  specify  in  my  data  set.

In  fact,  that's  what  I  did  here.

What  did  I  do  for  the  peaked  one?

Well,  if  I  look  at  the  T  distribution, and  I  reduce  the  degrees  of  freedom,

I'm  going  to  get  a  distribution  that's  relatively  peak.

I'm  going  to  see  a  positive  Kurtosis  in  that.

That's  one  way  I  can  understand   the shape of these distributions

so  that  I  can  use them  to my advantage

to  do  different  what  if  analysis   in JMP.

I'm  just  going  to  quickly go  back  to  my  slides.

Thank  you  very  much  for  listening.

If  you  have  any  questions,

I  look  forward  to  receiving  them   on  the  user  community.

As I  mentioned, this  project  will  be  posted  there,

and  the  summary  abstract is posted  at  this  link  here.

Thank  you again.

Comments

Thank you in advance to those of you who will take the time to review and engage with this presentation!  

We would like to share some information which we think is useful for working with JMP in the context of this project. 

 

1) JMP Distribution and Probability Calculator.  This is extremely useful for understanding the shapes of population distributions, for the purpose of applying the correct "transformation" to the normally distributed data set (for simulation of a random sample using column formulas in JMP).

 

Where to find this? 


(JMP 16)

Help > Sample Data

Under Teaching Resources > Teaching Scripts > Interactive Teaching Modules > Distribution Calculator

 

(JMP 17)

Help > Sample Index

Under Teaching Resources > Teaching Scripts > Calculators > Distribution Calculator

 

2) Calculation of Percentiles for Boxplot [potential] Outlier Determination in JMP.  This information is helpful for understanding how JMP is calculating the potential outliers in a "simple [outlier] boxplot," which you can generate under the Distribution Platform.  JMP uses a weighted average aimed method, specifically in accordance with SAS's UNIVARIATE procedure, using percentile definition (PCTLDEF) = 4. 

 

For more information see: https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/procstat/procstat_univariate_details14.htm.

I would like to share credit with my colleague @AdamMorris for his thoughtful contributions in driving towards a more  in-depth practical understanding of BoxPlot calculations in JMP.

 

P.S. A kind introduction to the box plot is also offered by the JMP Statistics Knowledge Portal (SKP):

https://www.jmp.com/en_be/statistics-knowledge-portal/exploratory-data-analysis/box-plot.html

johnbell

Thanks Patrick, nice job!

Thanks @johnbell for your support! Hope you are well.