SAS/VA Optimization Techniques

Overview

This post discusses a number of optimization techniques which, if implemented, will dramatically increase the scalability and performance of an existing SAS/VA reporting infrastructure with no changes to the physical or operating system layers.

Trimming Columns

Many of the existing tables in Oracle and their corollaries in HDFS and LASR have far more columns that are needed by VA reports.  In some cases, over 80% of the columns are not needed.  The LASR architecture is such that even if a VA Report based on a LASR table only requires 40 columns that report will still load the entire LASR table into memory.  Trimming the columns can reduce LASR table in a near linear fashion.  For example, a 100GB LASR able with 100 columns trimmed to 20 columns will likely result in a 20GB LASR table.  For sufficiently large tables with roughly similar data density, ETL times will be reduced by a similar percent to the percent of columns trimmed.

Importantly both the ETL process and the end-user report loading experience benefit as the VA infrastructure is no longer burdened with needing to load 80GB of un-needed data.

Ideally this trimming could happen on the Oracle side but given the challenge of identifying every column in use by all VA reports it may be more expedient to first trim in HDFS and LASR particularly if the underlying Oracle tables are in use by other non-SAS/VA tools.

Of note SAS/VA 7.3 includes a “Relationships Report” that provides “lineage and dependency information” for VA objects.  Having this in place will allow report developers to identify and drop un-needed columns.

Shortening Tables

Similar to trimming columns, dropping un-needed rows from the the data stream yields improvements in ETL, storage, memory and performance.  While it may not be practical in all cases to drop entire year’s worth of data there may be specific instances in which data can be segmented (e.g. at the school level) to improve LASR loading and report viewing performance.

Pursuing this strategy is not without its own drawback, however, and the improvement in performance would need to be balanced against the effort to maintain multiple LASR tables (which may or may not be onerous).

As with column trimming this recommendation could be implemented at the database and/or LASR layer.

Create indexes for all WHERE and JOIN columns

In cases in which the data being loaded to LASR requires SQL JOIN and/or WHERE clauses the columns being used in those clauses should each have their own index.  Testing on financial transaction data that is joined and filtered by supporting tables has shown a 5x to 10x reduction in job time when appropriate indexes are created.

These indexes can exist in Oracle (and will be respected by SAS DI Studio if there are) but can also be created dynamically in DI Studio on SAS Work tables.

Set column order by variable type

Further optimization of the remaining columns can be achieved by storing the order in which the columns themselves are defined in each table.  Best practice is to store numeric and then character columns:

“In a join context, all noncharacter SAS columns consume 8 bytes. Ordering noncharacter columns first perfectly aligns internal SAS SQL buffers so that the character columns, which are listed last in the row returned by the execution of the SELECT statement, are not padded. The result is fewer I/Os and lower CPU consumption, particularly in the sort phase of a sort-merge join. 
In SAS® Data Integration Studio, use a transform’s Mapping tab to rearrange columns by column type. Do this by sorting the “type” column in the target table in descending order. “

Set appropriate data length for each variable

Similar to trimming columns, setting the appropriate data length for each variable will save I/O and CPU.  Note that when setting these data lengths, we are simply choosing to not store empty data space which while “empty” actually does take up physical space on disk and in memory.

Tune HDFS block size

Setting HDFS block size can have a significant impact on ETL times as well as the resources needed by the tracker node which has to manage to location of each block across HDFS.  The default for most commercial HADOOP installations is 128MB but SAS ships with a default of 2MB.  Given that many of the low performant HDFS tables are great than 100GB testing with larger block sizes could be a valuable area of investigation.

Compress HDFS and LASR table

Both Data Integration Studio and the Visual Data Query tools provide mechanisms to compress data in HDFS and LASR.  Compressing comes with its own set of pros and cons.  On the pro side is a large (testing has shown 20x to 150x) reduction in storage needs.  On the con side is an increase in the CPU needed to load reports.  That said, if other techniques in this paper are implemented to reduce the initial pre-compressed version of a table to only those columns required, it has been demonstrated that the time to load a VA report based on a compressed LASR table can actually be less than that of the uncompressed and untrimmed LASR table.

Tune buffer size on database library

The default readbuff size for reads from Oracle libraries is 250.  This is likely less than ideal.  Testing has shown that a value of 5000 to 10000 is more appropriate for our table sizes.  A good write-up of this exists here from Caroline Bahler.

Set threaded DBMS = ALL and/or SQL pass-through = YES

There is a very thorough write-up at lexjansen.com  In short, though, multi-threaded reads can outperform SQL pass-through processing in some circumstances.  Controlled testing in our environment against specific jobs in conjunction with DBAs monitoring load on the Oracle servers would be valuable and likely result in additional tuning improvements.

Minimize in-memory calculations

While in-memory calculations are a main features in SAS/VA they do incur a performance penalty, particularly if you are also compressing the underlying LASR table as has been recommended here.  If your reports are relatively static in nature (that is to say the data may change but the report definitions themselves do not) then your reports will benefit from referencing fields that are pre-defined in LASR vs. dynamically created at run-time.  By instantiating the data in LASR itself you incur a performance penalty when the job is run but that may only happens once or twice a day.  If you dynamically calculate various fields (e.g. ending balance = starting balance – expenses) then that calculation happens every time the report loads.  This means that every time every user changes any filter and re-loads any page then the calculation has to run again.  If this is happening against a table with million of records then you are talking about millions of calculation per interaction which is going to cost you in terms of performance.

References