Sql-server – How to make the SSIS process consume more resources and run faster

performancesql serversql-server-2008-r2ssis

I have a daily ETL process in SSIS that builds my warehouse so we can provide day-over-day reports.

I have two servers – one for SSIS and the other for the SQL Server Database. The SSIS server (SSIS-Server01) is an 8CPU, 32GB RAM box. The SQL Server database (DB-Server) is another8CPU, 32GB RAM box. Both are VMWare virtual machines.

In its oversimplified form, the SSIS reads 17 Million rows (about 9GB) from a single table on the DB-Server, unpivots them to 408M rows, does a few lookups and a ton of calculations, and then aggregates it back to about 8M rows that are written to a brand new table on the same DB-Server every time (this table will then be moved into a partition to provide day-over-day reports).

I have a loop that processes 18 months worth of data at a time – a grand total of 10 years of data. I chose 18 months based on my observation of RAM Usage on SSIS-Server – at 18 months it consumes 27GB of RAM. Any higher than that, and SSIS starts buffering to disk and the performance nosedives.

Here is my dataflow http://img207.imageshack.us/img207/4105/dataflow.jpg

enter image description here

I am using Microsoft's Balanced Data Distributor to send data down 8 parallel paths to maximize resource usage. I do a union before starting work on my aggregations.

Here is the task manager graph from the SSIS server

alt text

Here is another graph showing the 8 individual CPUs

alt text

As you can see from these images, the memory usage slowly increases to about 27G as more and more rows are read and processed. However the CPU usage is constant around 40%.

The second graph shows that we are only using 4 (sometimes 5) CPUs out of 8.

I am trying to make the process run faster (it is only using 40% of the available CPU).

How do I go about making this process run more efficiently (least time, most resources)?

Best Answer

After good suggestions made by bilinkc, and without knowing where bottleneck is I would try another few things.

As You already noted, You should work on parallelism, not processing more data (months) in the same dataflow. You have already made transformations run in parallel, but source and destination (as well as aggregation) is not running in parallel! So read to the end and keep in mind that You should make them run in parallel too in order to utilize your CPU power. And don't forget that You are memory bound (can't aggregate infinite number of months in one batch) so the way to go ("scale-out") is to get chunk of data, process it and put it into destination database as soon as possible. This requires to eliminate common components (one source, one union All) because each chunk of data is limited to speed of those common components.

Source related optimization:

  • try multiple sources (and destinations) in the same dataflow instead of Balanced Data Distributor - You use clustered index on date column so your database server is capable to quickly retrieve data in date based ranges; if you run package on different server than database resides on, You will increase network utilization

Transformations related optimization:

  • Do You really need to do Union All before Aggregate? If not, take a look at Destination related optimization regarding multiple destinations
  • Set Keys, KeyScale, and AutoExtendFactor for Aggregate component to avoid re-hashing - if these properties are incorrectly set then You will see warning during package execution; note that predicting optimal values are easier for batches of fixed number of months than on infinite number (like in your case 18 and raising)
  • Consider aggregating and (un)pivotting in SQL Server instead doing it in SSIS package - SQL Server outperforms Integration Services in these tasks; of course, transformations logic may be such that forbids aggregating before performing some transformations in package
  • if You can aggregate (and pivot/unpivot) (for instance) monthly data in database, try doing it in source query or in destination database with SQL; depending on your environment, writing to separate table in destination database, building index, SELECT INTO with aggregating with SQL might be faster than doing it in package; note that parallelizing such activities will put a lot of pressure on your storage
  • You have a multicast at the end; I don't know how many rows get there, but consider following: write to destination to the right (on the screenshot) then populate records to the destination to the left in SQL query (to eliminate second aggregation and to release resources - SQL Server will probably do it much faster)

Destination related optimization:

  • use SQL Server Destination if possible (package must be run on the same server as database and destination database must be SQL Server); note that it requires exact columns data type match (pipeline -> table columns)
  • consider setting Recovery Model to Simple on your destination (data warehouse) database
  • paralelize destinations - instead of union all + aggregate + destination use separate aggregates and separate destinations (to the same table); here You should consider partitioning your destination table and putting partitions on separate filegroups; if you process data month by month make partitions by month and use partition switching

Seems like I stayed unclear about which way to go with parallelism. You can try:

  • putting multiple sources in single dataflow requires You to copy and paste transformation logic and destination for each source
  • running multiple dataflows in parallel where each dataflow processes only one month
  • running multiple packages in parallel where each package has one dataflow which processes only one month; and one master package to control execution of each (month) package - this is preferred way because You will probably run package only for one month once You get into production
  • or the same as previous but with Balanced Data Distributor and Union All and Aggregate

Before You do anything else, You might want to do a quick test: get your original package, change it to use 1 month, make exact copy that processes another month and run those packages in parallel. Compare it to your original package processing 2 months. Do the same for 2 separate 6-months package and single 12month package at time. It should run your server at full CPU usage.

Try not to overparalellize because You will have multiple writes to destination, so You don't want to start 18 parallel monthly packages, but rather 3 or 4 for start.

And finally, I strongly believe that memory and destionation I/O pressures are the ones to be eliminated.

Please inform us on your progress.