Re: data spliting

tuo88138 · Dec 12, 2022 12:18 PM

I want to feed data as three sets: train, validation, and test set to jmp. how can I do this? does jmp use test set? or just training and validation?

P_Bartell · Dec 12, 2022 12:33 PM

JMP Pro has a native capability to establish for any observation, a designation for training, validation, and test. Then within the analysis platforms the observations are handled appropriately. If you only have JMP you can still use this modeling best practice...it's just way more keystrokes and menu selections with lots of manual handling of data. I think my former colleague (I'm a SAS/JMP retiree) @Jeff_Perkinson composed a blog post many years ago now with suggestions on a workflow for the JMP pathway? Or maybe I'm misremembering.

Here's a link for 'how to' make a validation column in JMP and JMP Pro:Creating a validation column

tuo88138 · Dec 20, 2022 12:31 PM

Thank you so much.

Bill_Worley · Dec 12, 2022 12:55 PM

Hello @tuo88138,

Welcome to the JMP Community!

Standard JMP uses 2-level validation as a rule and JMP Pro has 3-level validation built in to the Analyze platform under Predictive Modeling. In Standard JMP, you can always hide and exclude your test set data when building your model and then bring it back in to see how it affects your model and/or how well the training and validation sets did in the modeling process.

You can also build your own validation columns manually in JMP Pro. Create a new column and go to Column Info and the Initialize Data > Missing/Empty drop down. Select Random and Random Indicator to break the data up into 2 or 3 level validation sets. To see how to build a validation column in JMP Pro check out the link below.

Creating a Validation Column (Holdout Sample) | JMP

HTH

Bill

P_Bartell · Dec 14, 2022 01:38 PM

Just to add a bit to both my initial reply and @Bill_Worley 's, I'm not sure if you are aware but JMP also supports other modeling cross validation methods such as k-fold, leave one out, etc. Implementation of these methods is most commonly embedded in the analysis platform launch/specification workflow rather than setting up specific columns or train/validate/test designations for observations. So depending on the modeling method, the practical problem at hand and the data...one of these methods might be useful as well. Like so much in statistics...'it depends'.