Fault Detection and Diagnosis of the Tennessee Eastman Process Using Multivariate Control Charts (2021-EU-45MP-782)

4 Kudos

Level: Intermediate

Jeremy Ash, JMP Analytics Software Tester, SAS

The Model Driven Multivariate Control Chart (MDMCC) platform enables users to build control charts based on PCA or PLS models. These can be used for fault detection and diagnosis of high dimensional data sets. We demonstrate MDMCC monitoring of a PLS model using the simulation of a real-world industrial chemical process: the Tennessee Eastman Process. During the simulation, quality and process variables are measured as a chemical reactor produces liquid products from gaseous reactants. We demonstrate how MDMCC can perform online monitoring by connecting JMP to an external database. Measuring product quality variables often involves a time delay before measurements are available which can delay fault detection substantially. When MDMCC monitors a PLS model, the variation of product quality variables is monitored as a function of process variables. Since process variables are often more readily available, this can aid in the early detection of faults. We also demonstrate fault diagnosis in an offline setting. This often involves switching between multivariate control charts, univariate control charts and diagnostic plots. MDMCC provides a user-friendly interface to move between these plots.

Auto-generated transcript...

Speaker	Transcript
	Hello, I'm Jeremy Ash. I'm a
	statistician in JMP R&D. My job
	primarily consists of testing
	the multivariate statistics
	platforms in JMP, but I also
	help research and evaluate
	methodology. Today I'm going to
	be analyzing the Tennessee
	Eastman process using some
	statistical process control
	methods in JMP. I'm going to
	be paying particular attention
	to the model driven multivariate
	control chart platform, which is
	a new addition to JMP 15.
	These data provide an
	opportunity to showcase the
	number of the platform's
	features. And just as a quick
	disclaimer, this is similar to
	my Discovery Americas talk. We
	realized that Europe hadn't seen a
	model driven multivariate
	control chart talk due to all the
	craziness around COVID, so I
	decided to focus on the basics.
	But there is some new material
	at the end of the talk. I'll
	briefly cover a few additional
	example analyses, then I put on
	the Community page for the talk.
	First, I'll assume some knowledge
	of statistical process control
	in this talk. The main thing it
	would be helpful to know about
	is control charts. If you're
	not familiar with these, these
	are charts used to monitor
	complex industrial systems to
	determine when they deviate
	from normal operating
	conditions.
	I'm not gonna have much time to
	go into the methodology of model
	driven multivariate control
	chart, so I'll refer to these other
	great talks that are freely
	available on the JMP Community
	if you want more details. I
	should also say that Jianfeng
	Ding was the primary
	developer of the model driven
	multivariate control control
	chart in collaboration with
	Chris Gotwalt and that Tonya
	Mauldin and I were testers. The
	focus of this talk will be using
	multivariate control charts to
	monitor a real world
	typical process; another novel
	aspect will be using control
	charts for online process
	monitoring. This means we'll be
	monitoring data continuously as
	it's added to a database and
	detecting faults in real time.
	So I'm going to start off with
	the obligatory slide on the
	advantages of multivariate
	control charts. So why not use
	univariate control charts? There
	are a number of excellent
	options in JMP. Univariate
	control charts are excellent
	tools for analyzing a few
	variables at a time. However,
	quality control data are often
	high dimensional and the number
	of control charts you need to
	look at can quickly become
	overwhelming. Multivariate
	control charts can summarize a
	high dimensional process in
	just a couple of control charts,
	so that's a key advantage.
	But that's not to say that
	univeriate control charts aren't
	useful in this setting. You'll
	see throughout the talk that
	fault diagnosis often involves
	switching between multivariate
	and univariate charts.
	Multivariate control charts give
	you a sense of the overall
	health of the process, while
	univariate charts allow you to
	monitor specific aspects of the
	process. So the information is
	complementary. One of the goals
	of monitoring multivariate
	control chart is to provide some
	useful tools for switching
	between these two types of
	charts. One disadvantage of
	univariate charts is that
	observations can appear to be in
	control when they're actually
	out of control in the multivariate
	sense and these plots show what I
	mean by this. The univariate
	control chart for oil and
	density show the two
	observations in red as in
	control. However, oil and density
	are highly correlated and both
	observations are out of control.
	in the multivariate sense,
	specially observation 51, which
	fairly violates the correlation
	structure of the two variables,
	so multivariate control charts
	can pick up on these types of
	outliers, while univariate
	control charts can't.
	Model driven multivariate
	control chart uses projection
	methods to construct the charts.
	I'm going to start by explaining PCA
	because it's easy to build up
	from there. PCA reduces the
	dimensionality of the process by
	projecting data onto a low
	dimensional surface. Um,
	this is shown in the picture
	on the right. We have P
	process variables and N
	observations, and
	the loading vectors in the P
	matrix give the coefficients for
	linear combinations of our X
	variables that result in
	square variables with
	dimension A, where the dimension
	A is much less than P. And then
	this is shown in equations on
	the left here. The X can be
	predicted as a function of the
	score and loadings, where E is
	the prediction error.
	These scores are selected to
	minimize the prediction error,
	and another way to think about
	this is that you're maximizing
	the amount of variance explained
	in the X matrix.
	Then PLS is a more suitable
	projection method. When you have
	a set of process variables and a
	set of quality variables, you
	really want to ensure that the
	quality variables are kept in
	control but these variables
	are often expensive or time
	consuming to collect. The plant
	could be making product without
	a control quality for a long
	time before a fault is detected.
	So PLS models allow you to
	monitor your quality variables
	as a function of your process
	variables and you can see that
	the PLS models find the score
	variables that maximize the
	amount of variation explained of
	the quality variables.
	These process variables are
	often cheaper or more readily
	available, so PLS can enable you
	to detect faults in quality
	early and make your process
	monitoring cheaper. And from here
	on out I'm going to focus on PLS
	models because it's more
	appropriate for the example.
	So PLS model partitions your
	data into two components. The
	first component is the model
	component. This gives the
	predicted values of your process
	variables. Another way to think
	about it is that your data has
	been projected into the model
	plane defined by your score
	variables and T squared monitors
	the variation of your data
	within this model plane.
	And the second component is the
	error component. This is the
	distance between your original
	data and the predicted data and
	squared prediction error (SPE)
	charts monitor this variation.
	Another alternative metric we
	provide is the distance to model
	X plane or DModX. This is just
	a normalized alternative to SPE
	that some people prefer.
	The last concept that's
	important to understand for the
	demo is the distinction between
	historical and current data.
	Historical data are typically
	collected when the process was
	known to be in control. These
	data are used to build the PLS
	model and define the normal
	process variation so that a
	control limit can be obtained.
	And current data are assigned
	scores based on the model but
	are independent of the model.
	Another way to think about this
	is that we have training and
	test sets. The T squared control
	limit is lower for the training
	data because we expect less
	variability for the various...
	observations used to train the
	model whereas there's greater
	variability in P squared when
	the model generalizes to E test
	set. Fortunately, the theory
	for the variance of T squared is
	been worked out so we can get
	these control limits based on
	some distributional assumptions.
	In the demo will be monitoring
	the Tennessee Eastman process.
	I'm going to present a short
	introduction to these data. This
	is a simulation of a chemical
	process developed by Downs and
	Vogel, two chemists at Eastman
	Chemical. It was originally
	written in Fortran, but there
	are wrappers for Matlab and
	Python now. I just wanted to note
	that while this data set was
	generated in the '90s, it's still
	one of the primary data sets
	used to benchmark multivariate
	control methods in the
	literature. It covers the
	main tasks of multivariate
	control well and there is
	an impressive amount of
	realism in the simulation.
	And the simulation is based on
	an industrial process that's
	still relevant today.
	So the data were manipulated
	to protect proprietary
	information. The simulated
	process is the production of
	two liquid products from
	gaseous reactants within a
	chemical plant. And F here is
	a byproduct
	that will need to be siphoned
	off from the desired product.
	Um and...
	That's about all I'll say about that.
	So the process diagram looks
	complicated, but it really isn't
	that bad, so I'll walk you
	through it. Gaseous
	reactants A, D, and E flow into
	the reactor here.
	The reaction occurs and the
	product leaves as a gas. It's
	then cooled and condensed into
	liquid in the condenser.
	Then a vapor liquid separator
	recycles any remaining vapor and
	sends it back to the reactor
	through a compressor, and the
	byproduct and inert chemical B
	are purged in the purge stream,
	and that's to prevent any
	accumulation. The liquid product
	is pumped through a stripper,
	where the remaining reactants
	are stripped off.
	And then sent back to the reactor.
	And then finally, the
	purified liquid product
	exits the process.
	The first set of variables being
	monitored are the manipulated
	variables. These look like bow
	ties in the diagram. I think
	they're actually meant to be
	valves and the manipulated
	process...or the manipulated
	variables mostly control the
	flow rate through different
	streams of the process.
	And these variables can be set
	to any values within limits and
	have some Gaussian noise.
	The manipulated variables are able
	to be sampled in the rate,
	but we use the default 3
	minutes sample now.
	Some examples of the manipulated
	variables are the valves that
	control the flow of reactants
	into the reactor.
	Another example is a valve
	that controls the flow of
	steam into the stripper.
	And another is a valve that
	controls the flow of coolant
	into the reactor.
	The next set of variables are
	measurement variables. These are
	shown as circles in the diagram.
	They were also sampled at three
	minute intervals. The
	difference between manipulated
	variables and measurement
	variables is that the
	measurement variables can't be
	manipulated in the simulation.
	Our quality variables will be
	the percent composition of
	two liquid products and you
	can see the analyzer
	measuring the products here.
	These variables are sampled with
	a considerable time delay, so
	we're looking at the purge
	stream instead of the exit
	stream, because these data are
	available earlier. And will use
	a PLS model to monitor process
	variables as a proxy for these
	variables because the process
	variables have less delay and
	affect faster sampling rate.
	So that should be enough
	background on the data. In
	total there are 33 process
	variables and two quality
	variables. The process of
	collecting the variables is
	simulated with a set of
	differential equations. And this
	is just a simulation, but as you
	can see a considerable amount of
	care went into modeling this
	after a real world process. Here
	is an overview of the demo I'm
	about to show you. We will collect
	data on our process and store
	these data in a database.
	I wanted to have an example that
	was easy to share, so I'll be
	using a SQLite database, but
	the workflow is relevant to most
	types of databases since most
	support ODBC connections.
	Once JMP forms an ODBC
	connection with the database,
	JMP can periodically check for
	new observations and add them to
	a data table.
	If we have a model driven
	multivariate control chart
	report open with automatic
	recalc turned on, we have a
	mechanism for updating the
	control charts as new data come
	in and the whole process of
	adding data to a database would
	likely be going on a separate
	computer from the computer
	that's doing the monitoring. So
	I have two sessions of JMP open
	to emulate this. Both sessions
	have their own journal
	in the materials on the
	Community, and the session
	adding new simulated data to
	the database will be called
	the Streaming Session and
	session updating the reports
	as new data come in will be
	called the Monitoring Session.
	One thing I really liked about
	the Downs and Vogel paper was
	that they didn't provide a
	single metric to evaluate the
	control of the process. I have
	a quote from the paper here
	"We felt that the tradeoffs
	among the possible control
	strategies and techniques
	involved much more than a
	mathematical expression."
	So here are some of the goals
	they listed in their paper,
	which are relevant to our
	problem. They wanted to maintain
	the process variables at
	desired values. They wanted to
	minimize variability of product
	quality during disturbances, and
	they wanted to recover quickly
	and smoothly from disturbances.
	So we'll see how well our
	process achieves these goals
	with our monitoring methods.
	So to start off in the
	Monitoring Session journal, I'll
	show you our first data set.
	The data table contained all of
	the variables I introduced
	earlier. The first variables are
	the measurement variables; the
	second are the composition.
	And the third are the
	manipulated variables.
	The script up here will fit
	a PLS model. It excludes the
	last 100 rows as a test set.
	Just as a reminder,
	the model is predicting 2
	product composition
	variables as a function of
	the process variables. If
	you have JMP Pro, there
	have been some speed
	improvements to PLS
	in JMP 16.
	PLS now has a
	fast SVD option.
	You can switch to the
	classic in the red
	triangle menu. There's
	also been a number of
	performance improvements
	under the hood.
	Mostly relevant for datasets
	with a large number of
	observations, but that's
	common in the multivariate
	process monitoring setting.
	But PLS is not the focus of the
	talk, so I've already fit the
	model and output score columns
	and you can see them here.
	One reason that the monitor
	multivariate control chart was
	designed the way it is, is that
	imagine you're a statistician
	and you want to share your model
	with an engineer so they can
	construct control charts. All
	you need to do is provide the
	data table with these formula
	columns. You don't need to share
	all the gory details of how you
	fit your model.
	Next, I'll provide the score
	columns to monitor the
	multivariate control chart.
	Drag it to the right here.
	So on the left here you can see
two types of control charts	the
	T squared and SPE.
	Um, there are 860 observations
	that were used to estimate the
	model and these are labeled as
	historical. And then the hundred
	that were left out as a test set
	are your current data.
	And you can see in the limit
	summaries, the number of points
	that are out of control and the
	significance level. Um, if you
	want to change the significance
	level, you can do it up here in
	the red triangle menu.
	Because the reactor's in normal
	operating conditions, we expect
	no observations to be out of
	control, but we have a few false
	positives here because we
	haven't made any adjustments for
	multiple comparisons. It's
	uncommon to do this, as far as I
	can tell, in multivariate
	control charts. I suppose you
	have higher power to detect out
	of control signals without a
	correction. In control chart
	lingo, this is means you're out
	of control. Average run length
	is kept low.
	So on the right here we
	also have contribution
	plots and on the Y axis are
	the observations; on the X
	axis, the variables. A
	contribution is expressed
	as a portion.
	And then at the bottom here,
	we have score plots. Right
	now I'm plotting the first
	score dimension versus the
	second score dimension, but
	you can look at any
	combination of score
	dimensions using this
	dropdown menus or the arrow
	button.
	OK, so I think we're oriented
	to the report. I'm going to
	now switch over to the
	scripts I've used to stream
	data into the database that
	the report is monitoring.
	In order to do anything for this
	example, you'll need to have a
	SQLite ODBC driver installed
	for your computer. This is much easier
	to do on a Windows computer,
	which is what you're often using
	when actually connecting to a
	database. The process on the Mac
	is more involved, but I put some
	instructions on the Community
	page. And then I don't have time
	to talk about this, but I
	created the SQLite database
	I'll be using in JMP and I
	plan to put some instructions
	in how to do this on the
	Community Web page. And hopefully
	that example is helpful to you
	if you're trying to do this with
	data on your own.
	Next I'm going to show
	you the files that I put
	in the SQLite database.
	Here I have the historical data.
	This was used to construct
	the PLS model. There are 960
	observations that are in
	control. Then I have the
	monitoring data, which at first
	just contains the historical
	data, but I'll gradually add new
	data to this. This is the data
	that the multivariate control
	chart will be monitoring.
	And then I've simulated new
	data already and added it to the
	data table here. These are
	another 960 odd measurements
	where a fault is introduced at
	some time point. I wanted to
	have something that was easy to
	share, so I'm not going to run
	my simulation script and add to
	the database that way. We're
	just going to take observations
	from this new data table and
	move them over to the monitoring
	data table using some JSL and
	SQL statements. This is just an
	example emulating the process
	of new data coming into a
	database. Somehow you might not
	actually do this with JMP, but
	this was an opportunity to show
	how you can do it with JSL.
	Clean up here.
	And next I'll show you this
	streaming script. This is a
	simple script, so I'm going to
	walk you through it real quick.
	This first set of
	commands will open the
	new data table and
	it's in the SQLite database,
	so it opens the table in the
	background so I don't have to
	deal with the window.
	Then I'm going to take pieces
	from this data table and add
	them to the monitoring data
	table. I call the pieces
	bites and the bite size is 20.
	And then this next command will
	connect to the database. This
	will allow me to send the
	database SQL statements.
	And then this next bit
	of code is
	iteratively sending SQL
	statements that insert new
	data into the monitoring data.
	And I'm going to
	initialize K and show you the
	first iteration of this.
	This is a simple SQL statement,
	insert into statement that
	inserts the first 20
	observations into the data
	table. This print statement is
	commented out so that the code
	runs faster and then I also
	have a wait statement to slow
	things down slightly so that
	we can see their progression
	in the control chart.
	And this would just go too fast
	if I didn't slow it down.
	Um, so next I'm going to move
	over to the monitoring sessions
	to show you the scripts
	that will update the report
	as new data come in.
	This first script is a simple
	script. That will check the
	database every .2 seconds for
	new observations and add them
	to the JMP table. Since the
	report has automatic recalc
	turned on, the report will update
	whenever new data are added. And
	I should add that
	realistically,
	you probably wouldn't use a
	script that just iterates like
	this. You probably use task
	scheduler in Windows or
	Automator on Mac to better
	schedule runs of the script.
	And then there's also another
	script that will
	push the report to JMP Public
	whenever the report is updated,
	and I was really excited that
	this is possible with JMP 15.
	It enables any computer with a
	web browser to view updates to
	the control chart. Then you
	can even view the report on
	your smartphone, so this makes
	it really easy to share
	results across organizations.
	And you can also use JMP Live
	if you wanted the reports to
	be on restricted server.
	I'm not going to have time
	to go into this in this
	demo, but you can check out
	my Discovery Americas talk.
	Then finally down here, there is
	a script that recreates the
	historical data in the data
	table if you want to run the
	example multiple times.
	Alright, so next...make sure
	that we have the historical data...
	I'm going to run the
	streaming script and see
	how the report updates.
	So the data is in control at
	first and then a fault is
	introduced, but there's a
	plantwide control system
	that's implemented in the
	simulation, and you can see
	how the control system
	eventually brings the process
	to a new equilibrium.
	Wait for it to finish here.
	So if we zoom in,
	seems like the process first
	went out of control around this
	time point, so I'm going to
	color it and
	label it, but it will
	show up in other plots.
	And then in the SPE plot,
	it looks like this
	observation is also out of
	control but only slightly.
	And then if we zoom in on
	the time point in the
	contribution plots, you can
	see that there are many
	variables contributing to
	the out of control signal at
	first. But then once the
	process reaches a new
	equilibrium, there's only
	two large contributors.
	So I'm going to remove the heat
	maps now to clean up a bit.
	You can hover over
	the point at which the process
	first went out of control and
	get a peek at the top ten
	contributing variables. This
	is great for giving you a
	quick overview which variables
	are contributing most to the
	out of control signal.
	And then if I click on the plot,
	this will be appended to the
	fault diagnosis section.
	And as you can see, there's
	several variables with large
	contributions and just sorted
	on the contribution.
	And for variables with
	red bars the observation is
	out of control in the univariate
	control charts. You can see
	this by hovering over one of
	the bars and these graphlets
	are IR charts for an
	individual variable with a
	three Sigma control limit.
	You can see in the stripper
	pressure variable that the
	observation is out of
	control, but eventually the
	process is brought back under
	control. And this is the case
	for the other top
	contributors. I'll also show
	you one of the variables
	where we're in control, the
	univariate control chart.
	So the process was...
	there are many variables out
	of control in the process at
	the beginning, but
	process eventually reaches
	a new equilibrium.
	Um...
	To see the variables that
	contribute most to the shift in
	the process, we can use mean
	contribution proportion plot.
	These plots show the average
	contribution that the variables
	have to T squared for the group
	I've selected. Um, here if I
	sort on these.
	The only two variables with
	large contributions measure the
	rate of flow of reactant A in
	stream one, which is the flow of
	this reactant into the reactor.
	Both of these variables are
	measuring essentially the
	same thing, except one
	measurement...one is a
	measurement variable and the
	other is a manipulated
	variable.
	You can see that there is a
	large step change in the flow
	rate, which is what I programmed
	in the simulation. So these
	contribution plots allow you to
	quickly identify the root cause.
	And then in my previous talk I
	showed many other ways to
	visualize and diagnose faults
	using tools in the score plot.
	This includes plotting the
	loadings on the score plots and
	doing some group comparisons.
	You can check out my Discovery
	Americas talk on the JMP
	Community for that. Instead, I'm
	going to spend the rest of this
	time introducing a few new
	examples, which I put on the
	Community page for this talk.
	So.
	There are 20 programmable faults
	in the Tennessee Eastman process
	and they can be introduced in any
	combination. I provided two other
	representative faults here. Fault
	1 that I showed previously was
	easy to detect because the out
	of control signal is so large
	and so many variables are
	involved. The focus on the
	previous demo was to show how to
	use the tools and identify.
	faults out of a large number of
	variables and not to benchmark
	the methods necessarily.
	Fault 4, on the other hand,
	is a more subtle fault,
	and I'll show you it here.
	The fault i...that's programmed
	is a sudden increase in the
	temperature in the reactor.
	And this is compensated for by
	the control system by increasing
	the flow rate of coolant.
	And you can see that
	variable picked up here and
	you can see the shift in
	contribution plots.
	And then you can also see
	that most other variables
	aren't affected
	by the fault. You can see a
	spike in the temperature here
	is quickly brought back under
	control. Because most other
	variables aren't affected, this
	is hard to detect for some
	multivariate control methods.
	And it can be more
	difficult to diagnose.
	The last fault I'll show you
	is Fault 11.
	Like Fault 4, it also involves
	the flow of coolant into the
	reactor, except now the fault
	introduces large oscillations in
	the flow rate, which we can
	see in the univariate control
	chart. And this results in a
	fluctuation of reactor
	temperature. The other
	variables aren't really
	affected again, so this can be
	harder to detect for some
	methods. Some multivariate
	control methods can pick up on
	Fault 4, but not Fault 11 or
	vice versa. But our method was
	able to pick up on both.
	And then finally, all the
	examples I created using the
	Tennessee Eastman process had
	faults that were apparent in
	both T squared and SPE plots. To
	show some newer features in
	model driven multivariate
	control chart, I wanted to show
	an example of a fault that
	appears in the SPE chart but not
	T squared. And to find a good
	example of this, I revisited a
	data set which Jianfeng Ding
	presented in her former talk, and
	I provided a link to her talk
	in this journal.
	On her Community page,
	she provides several
	useful examples that are
	also worth checking out.
	This is a data set from Cordia
	McGregor's (?) classic paper on
	multivariate control charts. The
	data are processed variables
	measured in a reactor, producing
	polyethylene, and you can find
	more background in Jianfeng's
	talk. In this example, we
	have a process that went out of
	control. Let me show you this.
	And it's out of control in...
	earlier in the SPE chart than in
	the T squared.
	And if we look at the mean
	contribution
	plots for SPE,
	you can
	see that there is one variable
	with large contribution and it
	also shows a large shift in the
	univariate control chart, but
	there are also other variables
	with large contributions, but
	that are still in control in the
	univariate control charts.
	And it's difficult to determine from
	the bar charts alone why these
	variables had a large
	contributions. Large SPE values
	happen when new data don't
	follow the correlation structure
	of the historical data, which is
	often the case when new data are
	collected, and this means that
	your PLS model you trained is
	no longer applicable.
	From the bar charts, it's hard
	to know which pair of variables
	have their correlation structure
	broken. So new in 15.2, you
	can launch scatterplot matrices.
	And it's clear in the
	scatterplot matrix that the
	violation of correlations
	with Z2 is what's driving
	these large contributions.
	OK, I'm gonna switch back
	to the PowerPoint.
	And real quick, I'll summarize
	the key features of model driven
	multivariate control chart that
	were shown in the demo. The
	platform is capable of
	performing both online fault
	detection and offline fault
	diagnosis. There are many
	methods provided in the platform
	for drilling down to the root
	cause of faults. I'm showing you
	here some plots from a popular
	book, Fault Detection and
	Diagnosis in Industrial Systems.
	Throughout the book, the authors
	demonstrate how one needs to
	use multivariate and univariate
	control charts side by side
	to get a sense of what's going
	on in a process.
	An one particularly useful
	feature in model driven multivariate
	control chart is how
	interactive and user friendly
	it is to switch between these
	two types of charts.
	And that's my talk. Here is
	my email if you have any
	further questions. And
	thanks to everyone that
	tuned in to watch this.