Solved: Efficient DOE of one multi-level (3+) categorical variable and many continuous v...

AsymptoticRules · Sep 3, 2023 02:13 PM

I want to conduct a screening experiment to identify which factors affect my response (Y) the most. From past research, I am aware that these variables have interactions, so I need to have interaction terms in my model. The factors I identified as relevant are the following:

X1 - continuous
X2 - continuous
X3 - continuous
X4 - continuous
X5 - continuous
X6 - categorical with 2 levels only
X7 - categorical with 10 levels

To be specific, X7 is a type of chemical: a fatty acid. Fatty acids can be saturated or monounsaturated or polyunsaturated, longer chain or shorter chain, etc, so I selected 10 of them that spanned this range of properties. Because X7 has more than 3 levels, I am forced to use the Custom Design in JMP, and accounting for interaction terms in my model, this results in about 42 experiments. I wanted to ask if there is a more efficient way to do this? Specifically, a more efficient way to setup a DOE to study variables like X7 (that are categorical).

Thinking out loud, at first I thought that converting X7 from discrete to continuous variable (by maybe focusing on an important feature of fatty acids) could be the helpful because then I could simply pick a "high" and a "low" setting. But chemicals are multidimensional i.e. they cannot be just boiled down to one feature/number, so I do not believe this is the way forward.

So is there a more efficient way to design a study where you have a categorical variable like X7?

Victor_G · Sep 3, 2023 03:38 PM

Hi @AsymptoticRules,

Welcome in the Community !

Using chemical structures as categorical factors come with several drawbacks :

You may be able to analyze and select which molecule(s) perform best, but not always understand the reason(s) certain molecules perform best (polarity, hydrophilic/lipophilic behaviour, molecular volume, number of atoms, H-donors/acceptors, ...),
You're able to analyze response depending on molecular structures used in the design, but not to predict the response for new molecules not yet used in the experimental design,
Using a categorical factor with so many levels come at the price of a lot of possible combinations with other factors, creating a design with a high number of experiments (which may not be very convenient in an early stage like screening phase).

As you mentioned being in a screening phase, it would be interesting to reduce the number of fatty acids candidates to screen, to reduce the number of experiments and interactions to screen and only keep the molecules with highest chemical variability, to detect significant effects and interactions, and from there augmenting the design to an optimization/predictive design in a second step, possibly with other molecule candidates.

In order to reduce the number of fatty acids candidates, I would try to analyze the chemical properties/molecular descriptors of the initial 10 fatty acids you plan to screen. Here is how I would do it :

Calculate/extract molecular descriptors from the chemical structures (several options can be available, with different libraries on Python like RDKit to calculate molecular descriptors, or extract them from public databases like PubChem, ChemSpider, and many others...)
Use a PCA (or other dimension reduction analysis) to keep a large part of the chemical information in a low number of dimensions (chemical properties/molecular descriptors are frequently highly correlated, so you may be able to keep >70% of the chemical information of this chemical class with only one principal component). This step should facilitate the analysis and selection of molecule candidates.
Plot the molecule candidates on a Parallel Coordinates Plot or another visualization with their principal component or raw molecular attributes to be able to select the most dissimilar molecules (at least a high and low level for each principal component for example). You can also simulate a DoE based only on the principal components as covariate factors, and see which molecules would have been selected in a D-optimal/screening design.

You can then use these selected molecules as levels of your categorical factor in your design, or directly use the Principal components as continuous factors/covariates in the design.

On this topic, you might find this presentation interesting : https://community.jmp.com/t5/Discovery-Summit-Europe-2017/Increase-Efficiency-and-Model-Applicabilit...

This is only one possible option, I'm sure other members of this forum may have different experiences with molecules as factors. I personally always try to transform the categorical information in a continuous information whenever possible with this type of approach.
I hope this will help you,

Victor GUILLER
Scientific Expertise Engineer
L'Oréal - Data & Analytics

View solution in original post

Victor_G · Sep 3, 2023 03:38 PM

Hi @AsymptoticRules,

Welcome in the Community !

Using chemical structures as categorical factors come with several drawbacks :

You may be able to analyze and select which molecule(s) perform best, but not always understand the reason(s) certain molecules perform best (polarity, hydrophilic/lipophilic behaviour, molecular volume, number of atoms, H-donors/acceptors, ...),
You're able to analyze response depending on molecular structures used in the design, but not to predict the response for new molecules not yet used in the experimental design,
Using a categorical factor with so many levels come at the price of a lot of possible combinations with other factors, creating a design with a high number of experiments (which may not be very convenient in an early stage like screening phase).

As you mentioned being in a screening phase, it would be interesting to reduce the number of fatty acids candidates to screen, to reduce the number of experiments and interactions to screen and only keep the molecules with highest chemical variability, to detect significant effects and interactions, and from there augmenting the design to an optimization/predictive design in a second step, possibly with other molecule candidates.

In order to reduce the number of fatty acids candidates, I would try to analyze the chemical properties/molecular descriptors of the initial 10 fatty acids you plan to screen. Here is how I would do it :

Calculate/extract molecular descriptors from the chemical structures (several options can be available, with different libraries on Python like RDKit to calculate molecular descriptors, or extract them from public databases like PubChem, ChemSpider, and many others...)
Use a PCA (or other dimension reduction analysis) to keep a large part of the chemical information in a low number of dimensions (chemical properties/molecular descriptors are frequently highly correlated, so you may be able to keep >70% of the chemical information of this chemical class with only one principal component). This step should facilitate the analysis and selection of molecule candidates.
Plot the molecule candidates on a Parallel Coordinates Plot or another visualization with their principal component or raw molecular attributes to be able to select the most dissimilar molecules (at least a high and low level for each principal component for example). You can also simulate a DoE based only on the principal components as covariate factors, and see which molecules would have been selected in a D-optimal/screening design.

You can then use these selected molecules as levels of your categorical factor in your design, or directly use the Principal components as continuous factors/covariates in the design.

On this topic, you might find this presentation interesting : https://community.jmp.com/t5/Discovery-Summit-Europe-2017/Increase-Efficiency-and-Model-Applicabilit...

This is only one possible option, I'm sure other members of this forum may have different experiences with molecules as factors. I personally always try to transform the categorical information in a continuous information whenever possible with this type of approach.
I hope this will help you,

Victor GUILLER
Scientific Expertise Engineer
L'Oréal - Data & Analytics

AsymptoticRules · Sep 3, 2023 08:18 PM

Thank you for the perspective, Victor! That does make sense to me! I will try to update this post when I give it a shot!

P_Bartell · Sep 4, 2023 07:49 AM

I agree wholeheartedly with @Victor_G 's thoughts. The only other thing I can think of to try and reduce the number of levels for X7 in the experiment is asking the question, 'Do you have historic observational data (maybe from production runs) where the X7 factor is varying across a wide range of values for the underlying chemical properties?" Then using variable identification modeling methods such as any of the tree based methods, or if you have JMP Pro, some of the Generalized Regression platforms that are adept at variable identification. One nice feature of many of these methods is they are somewhat robust to correlations amongst the factors compared to techniques like ordinary least squares regression or it's kissing cousin for variable identification, stepwise regression. You might be able to screen some of the X7 categorical levels.

Efficient DOE of one multi-level (3+) categorical variable and many continuous variables

Re: Efficient DOE of one multi-level (3+) categorical variable and many continuous variables

Re: Efficient DOE of one multi-level (3+) categorical variable and many continuous variables

Re: Efficient DOE of one multi-level (3+) categorical variable and many continuous variables

Re: Efficient DOE of one multi-level (3+) categorical variable and many continuous variables