Solved: Re: should i use principal component analysis or k-means cluster analysis?

learning_JSL · Jun 17, 2023 03:39 PM

Hi - I am trying to decide the best method of cluster analysis (e.g. principal component analysis, k-means, etc) to use for the following situation. I have a mapped dataset with 12,928 records, each corresponding to a well with sample results. Each row of data (each location on my map) has a well name, latitude, longitude, and results of compound A, compound B, compound C, compound D, compound E, etc (8 chemical compounds in all). These wells have been contaminated by one of three sources: 1) air deposition, 2) process waste, or 3) a combination of the two (mixed). And each source is associated with a unique source signature (e.g. the air deposition source tends to have high compound X and low compound Y, while process water tends to have high compound Y and Z and low compound X.). So, each row (i.e. well location) of my dataset is associated with one of the three sources of contamination. My goal is to identify which source is most likely for each record (i.e. well location) in my dataset.

Importantly, a subset of ~800 records in my dataset are known to be associated with the air deposition source signature. As such, this subset of data can be used as a training set for the air deposition signature. I can also come up with a subset of ~50 records that are representative of the process waste source signature.

Any suggested approaches in JMP or JMP Pro would be greatly appreciated. Thanks in advance!

P_Bartell · Jun 18, 2023 07:20 AM

I haven't looked at the Excel spreadsheet but I'm not sure exploratory data analysis methods such as clustering or dimensionality reduction methods such as PCA would be best suited for your practical problem application. My interpretation of your original post is based largely on the last sentence of your first paragraph. "My goal is to identify which source is most likely for each record...". To me this sounds like a classification modeling problem where contamination source is the dependent variable and the compounds are your predictor variables. If I'm on the right track, there are any number of modeling methods in JMP and JMP Pro suited for classification problems. You could start with nominal logistic regression, try any number of tree methods, or maybe the PLS-Discriminant Analysis platform in JMP Pro to name just a few. If you have JMP Pro and your data supports cross validation methods, I recommend you include those methods in your workflow as well...you are hinting at it with one of the sources so it sounds like you are amenable to those methods as well?

View solution in original post

learning_JSL · Jun 17, 2023 03:40 PM

See attached excel spreadsheet for a sample of the data.

P_Bartell · Jun 18, 2023 07:20 AM

I haven't looked at the Excel spreadsheet but I'm not sure exploratory data analysis methods such as clustering or dimensionality reduction methods such as PCA would be best suited for your practical problem application. My interpretation of your original post is based largely on the last sentence of your first paragraph. "My goal is to identify which source is most likely for each record...". To me this sounds like a classification modeling problem where contamination source is the dependent variable and the compounds are your predictor variables. If I'm on the right track, there are any number of modeling methods in JMP and JMP Pro suited for classification problems. You could start with nominal logistic regression, try any number of tree methods, or maybe the PLS-Discriminant Analysis platform in JMP Pro to name just a few. If you have JMP Pro and your data supports cross validation methods, I recommend you include those methods in your workflow as well...you are hinting at it with one of the sources so it sounds like you are amenable to those methods as well?

learning_JSL · Jun 19, 2023 02:27 PM

Hi P_Bartell - thanks very much for your reply. I'll try to clarify my objective. Each record represents a mapped well sample showing the results (concentrations) of many different PFAS compounds (each compound is a different field). Three sources of contamination are possible in my study area: (1) an air deposition source, which presumably is associated predominantly with the compound "PMPA", (2) a process waste source, which presumably is associated with "PFMOAA", and (3) a mixture of both air deposition and process waste. Of course, the world is not so neat and tidy, so both of those compounds (PMPA and PFMOAA) will occur in nearly all of the wells, regardless of their proximity to a either source, plus other compounds of course. It's the relative concentrations of the compounds that will help identify whether the well is impacted by one source or the others. And, it may be that other compounds are better surrogates or indicators?....tbd.

My objective is to decide which areas (i.e. wells) are contaminated by which source. (The spreadsheet I attached is just a sample of the data but gives you an idea of my data format.) Importantly, wells farther away from a source will naturally have decreasing concentrations of the different compounds that make up that well, but the relative ratios and compounds associated with that source should continue to hold. So a well's location is part of the story (i.e. where a well is can effect its compound makeup).

So, given the above, if you still think that discriminant analysis (DA), for example, is a good way to go, I have a couple of questions. (1) Aren't my samples (well results) supposed to be independent? And based on my description above, would they be? It seems that they are not truly independent as those close to one source type would be more like those located close to a different source. Thoughts? (2) The concentrations of each of my PFAS compounds (independent variables) is supposed to be normally distributed. Given that these are environmental contaminant data, this assumption is often not met, sometimes even with log normalizing. Is this a problem? (3) It seems that my predictor variables (PFAS compounds) would naturally be colinear. In other words, a given source would be associated with a few compounds that rise and fall together, depending on proximity to that source. Is this a problem? Thanks in advance!

P_Bartell · Jun 19, 2023 03:00 PM

I'll do my best to answer each of your questions. (1) Not necessarily. 'independence' is one of those great urban requirements response myths, like, normality (see my soon to come comments) that is not required to use a multitude of modeling methods. (2) Says who? Real life data is often not normally distributed wrt to the independent variables...who cares? (3) This is where a method such as PLS - Discriminant Analysis shines. At the risk of oversimplifying just a bit...what happens in the background of all PLS based methods (which shine with colinear predictor variables) is through some very elegant math, those colinear predictor variables are turned into what are called 'latent variables' which have some really nice properties...it's very analogous to doing principal component analysis on the predictors BEFORE modeling. A long time ago whilst I was still a JMP Senior Systems Engineer (I'm retired nowadays) I put together a Mastering JMP webinar that is still available. It's entitled "Using Partial Least Squares. When Ordinary Least Squares Regression Just Won't Work." You may want to watch it? Here's a link: P. Bartell - PLS I actually cover a classification type problem not too dissimilar to yours at the very end of the event. It's using human genome data (not normally distributed by any stretch) to classify women's estrogen receptor status (a binary response...but easily extensible to your situation with 3 possible classifiers). Lastly if you've got JMP Pro I'd also take a serious look at the various tree methods. Then keep all the models in the Formula Depot and compare how they perform. I think my webinar was done using JMP Pro v 13? So things may have changed a bit wrt to capabilities and functionality I show...since it looks like you might be running v 17 but the basic ideas still hold water. Good luck!

learning_JSL · Jun 19, 2023 05:43 PM

Thanks P_Bartell! I'll check out your link and do a little homework. I appreciate your detailed responses!

learning_JSL · Jun 19, 2023 08:44 PM

I watched your videos - very helpful! Assuming I use Discriminant Analysis in JMP my Y covariates are obviously all my numerous PFAS compound fields. But it is unclear how my X, category/class field - "source" - is supposed to work. I say this because I do not know what the source is for the vast majority of my dataset (which is why I'm doing this analysis).

Recall that my dataset looks like this:

I can add a source field to my dataset and populate it for about 10% of my rows (~1300 rows) because of the unique attributes of my study area. However, I can populate it only for the "air deposition" source. The other two sources ("process waste" and "mixed") are not easy to discern... thus my need to analyze my dataset (PLS for example). If pressed, I could probably populate the "process waste" source also, but only for about 30 rows, not more. The source field for the other 11,000 rows are unknown.

Am I misreading the workflow here?

P_Bartell · Jun 20, 2023 07:08 AM

No you aren't misreading the workflow...if you can't populate the data table in the Loc column with the three possible levels of that variable then you can't build a model, PLS or any other method for that matter. So I'm going to suggest you go back to your original thoughts about multivariate exploratory data analysis tools like clustering or principal components analysis on the compound variables...at best if some clusters arise or a relatively few number of principal components emerge you'll be able to see them among the compound variables.

Bill_Worley · Jun 20, 2023 09:11 AM

I am going to pile on to my old friend @P_Bartell's answer. In my opinion only, clustering sounds like a great place to start for your analytic needs. K-Means or Hierarchical clustering should work well and since you have 3 possible sources of contamination you should limit your K-Means or Hierarchical clustering to 3 clusters. When you are satisfied with the results, save the clusters back to the data table and this column can then be used as part of any further modeling you do. You can also save the cluster formula back to the data table and when you enter any new data into the table you will get an immediate prediction of the cluster (source) of the contamination.

learning_JSL · Jun 20, 2023 03:25 PM

Thank you both. I'll start there (cluster analysis) and see where it leads. I appreciate the input!