Solved: "Get rows where" is taking way to long

shampton82 · Oct 22, 2023 12:40 AM

I have a reference table where each column is a name from a main data table column and the values in the rows are the part numbers that have bad data for that column in the main table. So for example, in the main table Column 1 is the part number and column 2 is "Temp". Lets say that the row number equals part number so row 1 is part "1" and let say it has a value of 65 in the "Temp" column. Part 2 has a temp of -100 and thus is bad data. So in the reference table under the Column named "Temp" the value of row 1 is "2". This is repeated for all the part numbers that have bad data in the main table and then is repeated for each column as well.

What this gives me is a bad data reference table I can then loop through to automatically remove the bad data each time I pull in the newest data for the main table. I have the code that works however the line:

(where vlist contains all the part numbers from the reference table and dt2 is the main data table)

try(rowindex=dt2 << Get Rows Where(contains(vlist,:WORKORDERNO )));

is super slow. It takes 22 seconds for it to complete. As I have 196 columns to go through this takes way to long. Any suggestions on how to speed this up would be appreciated!

The full code is here:

names default to here(1);

dt=open("J:\Public\Ti Process Control\Dashboard Files\JMP data for CC's\Limits\master_data_table_clean_up.jmp", invisible(1)); //this is the reference table
dt<<begin data update;


cols=dt<<get column names();

dt2=data table("Master data pull daily download");  //this is the main table
dt2<<begin data update;
dt2 << Clear Select();

cols2=dt2<<get column names();

vlist={};
for(i=1,i<=2 /*N items(cols)*/,i++,
		try(vlist=column(dt,cols[i])<<get values;);
				starttime = tick seconds();
				try(rowindex=dt2 << Get Rows Where(contains(vlist,:WORKORDERNO )));
				endtime= tick seconds();
				print( "run time",eval insert("Analysis took ^round(endtime - starttime)^ seconds"));
				try(column(dt2,cols[i]) [rowindex] = .);
);

dt2<<end data update;

dt<<end data update;

close(dt,nosave);

txnelson · Oct 22, 2023 02:42 PM

I suspect all of the blanks will cause your code to slow down. Here is a work around for that. Just replace your creation of the vlist with

Try( vlist = Associative Array( As Column( cols[i] ) ) << get keys );
If( vlist[1] == "",
	Remove From( vlist, 1, 1 )
);

Jim

View solution in original post

ErraticAttack · Oct 22, 2023 02:43 AM

Consider how JMP will be doing name-resolution. When you run a Get Rows Where() function, it will go row-by-row over the table, evaluating the where expression. Each evaluation will require name resolution for all variables defined, and since your vlist vairable is defined outside of the running loop, it will have to do some scope-walking each iteration, leading to a lot of wasted work.

To get around this, consider pre-evaluating any names that will be constant throughout the lookup, like this:

Names Default To Here( 1 );

dt = Open(
	"J:\Public\Ti Process Control\Dashboard Files\JMP data for CC's\Limits\master_data_table_clean_up.jmp",
	invisible( 1 )
); //this is the reference table
dt << begin data update;


cols = dt << get column names();

dt2 = Data Table( "Master data pull daily download" );  //this is the main table
dt2 << begin data update;
dt2 << Clear Select();

cols2 = dt2 << get column names();

vlist = {};
For( i = 1, i <= 2 /*N items(cols)*/, i++,
	Try( vlist = Column( dt, cols[i] ) << get values );
	starttime = Tick Seconds();
	Eval( Eval Expr(
	Try( rowindex = dt2 << Get Rows Where( Contains( Expr( vlist ), :WORKORDERNO ) ) );
	) );
	endtime = Tick Seconds();
	Print( "run time", Eval Insert( "Analysis took ^round(endtime - starttime)^ seconds" ) );
	Try( Column( dt2, cols[i] )[rowindex] = . );
);

dt2 << end data update;

dt << end data update;

Close( dt, nosave );

Notice the Eval( Eval Expr( ... Expr( vlist ) ... ) ) expression -- that is pre-evaluating vlist since it will be a constant throughout the loop.

Jordan

bculver · Oct 22, 2023 10:07 AM

I believe you can use AsConstant(vList) to achieve the same effect, the name 'vlist' should only be evaluated once for each GetRowsWhere call.

shampton82 · Oct 22, 2023 10:13 AM

Thanks for the reply @ErraticAttack , I popped your code modifications in and the code executed in 21 seconds vs 22 seconds. So a little better but not where I need to be. Any other ideas would be appreciated!

bculver · Oct 22, 2023 10:42 AM

I also find using Loc to be faster in most situations like this, if you can find a way to use it. There may be a better way than this, but you can try something like this to replace the GetRowsWhere call.

dt = New Table( "BadTable",
	Add Rows( 100 ),
	New Column( "BadPartNumbers",
		Formula( Random Integer( 1000 ) ),
	),
);
dt2 = New Table( "FullTable",
	Add Rows( 10000 ),
	New Column( "NewPartNumbers",
		Formula( Random Integer( 1000 ) ),
	),
);

newPNumbers = dt2:NewPartNumbers<< GetValues;
badPNumbers = AssociativeArray(dt:BadPartNumbers << GetValues);

rowIndex = Loc(Matrix(badPNumbers << GetValues(AsList(newPNumbers))));

I'm not exactly sure how your tables are set up, but basically it's turning the bad part numbers into an Associative array where bad part numbers return 1 and anything else returns 0. When you pass a list of all part numbers to GetValues, it returns a list of 1's and 0's for each row. Calling Loc on this returns the indices where 1 is found, which should correspond to the rows with bad part numbers.

shampton82 · Oct 22, 2023 10:42 AM

@ErraticAttack So I made a test master data table and reference table to try and be able to share so people could play along but when I did that the script runs really fast, under a second. So now I'm wondering if it has something to do with my actual data table?

Here's the test tables

jthi · Oct 22, 2023 01:01 PM

Depending on your data there are many different methods you could use to perform these checks (joins/updates, loc, get rows where, associative array, summary table...). How many rows do your tables have? How many different values at max in your reference table columns? Are most of the values empty (remove empty values before performing comparison to make the list smaller or include it there just once)?

-Jarmo

shampton82 · Oct 22, 2023 02:12 PM

Hey @jthi ,

Here is an overview of the main table:

and heres the ref table:

Heres how much data each column in the ref table has:(from the missing data screener output)

Number Missing
0
0
0
0
3115
6075
12237
16591
16591
16591
16591
16772
17038
17093
17095
17095
17099
17099
17109
17110
17113
21542
22781
23634
24792
24792
25058
25172
25280
25341
25356
25376
25376
25827
25863
25875
25893
25894
25897
25905
25942
25957
25963
25963
25963
25965
25968
25968
25968
25968
26367
26564
27190
27404
27414
27420
27437
27453
27465
27474
27478
27493
27503
27522
27547
27553
27573
27592
27597
27597
27597
27597
27619
27626
27628
27628
27629
27629
27629
27629

So most the columns are full of "missing data"/blanks due to the few columns that have a ton of bad data.

Make me wonder if having all these blanks in the vlist is causing the slow down:

jthi · Oct 22, 2023 02:40 PM

Most likely it will cause a slowdown, especially if the value isn't found from the list

Names Default To Here(1);

test = {"a", "b", "c", "d", "e", "f"};
test2 = test || Repeat({""}, 27629);
lookup_val = "a";
repeats = 200;

tests = {};
For(i = 1, i <= repeats, i++,
	start = HP Time();
	Contains(test, lookup_val);
	wait(0);
	end = HP Time();
	Insert Into(tests, (end-start)/1e6);	
);

tests2 = {};
For(i = 1, i <= repeats, i++,
	start = HP Time();
	Contains(test2, lookup_val);
	wait(0);
	end = HP Time();
	Insert Into(tests2, (end-start)/1e6);
);

Show(Mean(tests), Std Dev(tests), Mean(tests2), Std Dev(tests2));
Show(Median(tests2)/Median(tests));

Edit:

Both of these test scripts are fairly quickly written so there can be some mistakes

Names Default To Here(1);

Random Reset(1);

dt_data = New Table("Data",
	Add Rows(30000),
	New Column("first", Numeric, Nominal, << Set Each Value(Row())),
	New Column("second", Numeric, Nominal, << Set Each Value(Row())),
	invisible
);
Column(dt_data, "first") << Set Data Type("Character");
Column(dt_data, "second") << Set Data Type("Character");


dt_lookup = New Table("Lookup",
	Add Rows(5000),
	New Column("first", Numeric, Nominal, Values(Random Shuffle(1::1000)[1::10])),
	New Column("second", Numeric, Nominal, Values(Random Shuffle(1::1000)[1::500])),
	invisible
);
Column(dt_lookup, "first") << Set Data Type("Character");
Column(dt_lookup, "second") << Set Data Type("Character");

start = HP Time();
dt_data << Get Rows Where(Contains(As Constant(dt_lookup[0, "first"]), :first));
dt_data << Get Rows Where(Contains(As Constant(dt_lookup[0, "second"]), :second));
end = HP Time();
show((end - start)/1e6);

wait(0);

start = HP Time();
dt_data << Get Rows Where(Contains(As Constant(Associative Array(dt_lookup[0, "first"]) << get keys), :first));
dt_data << Get Rows Where(Contains(As Constant(Associative Array(dt_lookup[0, "second"]) << get keys), :second));
end = HP Time();
show((end - start) / 1e6);

Close(dt_data, no save);
Close(dt_lookup, no save);

Depending on how many values you have and if they are characters (I think in your case they are), using Summarize() instead of Associative Array() to get the unique values will be faster but this won't most likely have that big of an effect in this case

Names Default To Here(1);

dt_lookup = New Table("Lookup",
	Add Rows(5000),
	New Column("test", Numeric, Nominal, Values(Random Shuffle(1::1000)[1::500])),
	invisible
);
Column(dt_lookup, "test") << Set Data Type("Character");

start = HP Time();
Summarize(dt_lookup, uniq_vals = by(Column(dt_lookup, "test")));
end = HP Time();
show((end - start) / 1e6);

start = HP Time();
uniq_vals2 = Associative Array(Column(dt_lookup, "test")) << get keys;
end = HP Time();
show((end - start) / 1e6);

Show(N Items(uniq_vals), N Items(uniq_vals2));
Close(dt_lookup, no save);

-Jarmo

jthi · Oct 22, 2023 03:44 PM

This goes most likely to category of premature optimization, but JMP's update can be sometimes very fast and it could most likely be used here with some column renaming and adding new column to your lookup table which is full of empty values

Names Default To Here(1);

dt_data = New Table("Data",
	Add Rows(30000),
	New Column("first", Numeric, Nominal, << Set Each Value(Row())),
	New Column("second", Numeric, Nominal, << Set Each Value(Row())),
	invisible
);
Column(dt_data, "first") << Set Data Type("Character");
Column(dt_data, "second") << Set Data Type("Character");


dt_lookup = New Table("Lookup",
	Add Rows(5000),
	New Column("first", Numeric, Nominal, Values(Random Shuffle(1::30000)[1::500])),
	invisible
);
Column(dt_lookup, "first") << Set Data Type("Character");

/*
start = HP Time();
Summarize(dt_lookup, uniq_vals = by(:first));
r = dt_data << Get Rows Where(Contains(As Constant(uniq_vals), :first));
dt_data[r, "second"] = "";
end = HP Time();
Show((end-start)/1e6);
*/

missing_col = dt_lookup << New Column("missing_vals", Numeric, Nominal);

start = HP Time();
i = 1;
Column(dt_lookup, "first") << Set Name(Char(i));
missing_col << Set Name("second");
dt_data << Update(With(dt_lookup), Match Columns(:first = Eval(Char(i))), Replace Columns in Main Table(:second));
end = HP Time();
Show((end-start)/1e6);

Close(dt_data, no save);
Close(dt_lookup, no save);

-Jarmo

"Get rows where" is taking way to long

Re: "Get rows where" is taking way to long

Re: "Get rows where" is taking way to long

Re: "Get rows where" is taking way to long

Re: "Get rows where" is taking way to long

Re: "Get rows where" is taking way to long

Re: "Get rows where" is taking way to long

Re: "Get rows where" is taking way to long

Re: "Get rows where" is taking way to long

Re: "Get rows where" is taking way to long

Re: "Get rows where" is taking way to long