After good suggestions made by bilinkc, and without knowing where bottleneck is I would try another few things.
As You already noted, You should work on parallelism, not processing more data (months) in the same dataflow. You have already made transformations run in parallel, but source and destination (as well as aggregation) is not running in parallel! So read to the end and keep in mind that You should make them run in parallel too in order to utilize your CPU power. And don't forget that You are memory bound (can't aggregate infinite number of months in one batch) so the way to go ("scale-out") is to get chunk of data, process it and put it into destination database as soon as possible. This requires to eliminate common components (one source, one union All) because each chunk of data is limited to speed of those common components.
Source related optimization:
- try multiple sources (and destinations) in the same dataflow instead of Balanced Data
Distributor - You use clustered index on date column so your database
server is capable to quickly retrieve data in date based ranges; if
you run package on different server than database resides on, You
will increase network utilization
Transformations related optimization:
- Do You really need to do Union All before Aggregate? If not, take a look at Destination related optimization regarding multiple destinations
- Set Keys, KeyScale, and AutoExtendFactor for Aggregate component to avoid re-hashing - if these properties are incorrectly set then You will see warning during package execution; note that predicting optimal values are easier for batches of fixed number of months than on infinite number (like in your case 18 and raising)
- Consider aggregating and (un)pivotting in SQL Server instead doing it in SSIS package - SQL Server outperforms Integration Services in these tasks; of course, transformations logic may be such that forbids aggregating before performing some transformations in package
- if You can aggregate (and pivot/unpivot) (for instance) monthly data in database, try doing it in source query or in destination database with SQL; depending on your environment, writing to separate table in destination database, building index, SELECT INTO with aggregating with SQL might be faster than doing it in package; note that parallelizing such activities will put a lot of pressure on your storage
- You have a multicast at the end; I don't know how many rows get there, but consider following: write to destination to the right (on the screenshot) then populate records to the destination to the left in SQL query (to eliminate second aggregation and to release resources - SQL Server will probably do it much faster)
Destination related optimization:
- use SQL Server Destination if possible (package must be run on
the same server as database and destination database must be SQL
Server); note that it requires exact columns data type match
(pipeline -> table columns)
- consider setting Recovery Model to Simple on your destination
(data warehouse) database
- paralelize destinations - instead of union all + aggregate +
destination use separate aggregates and separate destinations (to the
same table); here You should consider partitioning your
destination table and putting partitions on separate filegroups; if
you process data month by month make partitions by month and use
partition switching
Seems like I stayed unclear about which way to go with parallelism. You can try:
- putting multiple sources in single dataflow requires You to copy and
paste transformation logic and destination for each source
- running multiple dataflows in parallel where each dataflow processes
only one month
- running multiple packages in parallel where each package has one
dataflow which processes only one month; and one master package to
control execution of each (month) package - this is preferred way
because You will probably run package only for one month once You get
into production
- or the same as previous but with Balanced Data Distributor and Union All and Aggregate
Before You do anything else, You might want to do a quick test: get your original package, change it to use 1 month, make exact copy that processes another month and run those packages in parallel. Compare it to your original package processing 2 months. Do the same for 2 separate 6-months package and single 12month package at time. It should run your server at full CPU usage.
Try not to overparalellize because You will have multiple writes to destination, so You don't want to start 18 parallel monthly packages, but rather 3 or 4 for start.
And finally, I strongly believe that memory and destionation I/O pressures are the ones to be eliminated.
Please inform us on your progress.
The solution was two-fold:
- I exported the certificate from the primary site sever under "Personal\Certificates" the friendly name is "ConfigMgr SQL Server Identification Certificate." I then imported the certificate to my local workstation under "Trusted Root Certificate Authorities\Certificates." This allowed me to connect to SQL server when I choose "Use the current Windows user." Before I made this change I would get a certificate error when choosing this option.
- I had the DBA enable db_datareader access explicitly for my user in SQL Server.
When creating a dataset I still was prompted for credentials but I can now select current user and create queries.
Best Answer
SSRS snapshots will not give you the functionality you are looking for.
You can create a SSRS subscription that will output an excel file (or another file type) to a file share. You can take the query from your reports and use that to copy point in time data to another database/table (using a night job).
The easiest solution would probably be the subscription though.
HTH, Dan