Sql – Type II dimension joins

data-warehousedimensionssqlsql servertype-2-dimension

I have the following table lookup table in OLTP

CREATE TABLE TransactionState
(
    TransactionStateId INT IDENTITY (1, 1) NOT NULL,
    TransactionStateName VarChar (100)
)

When this comes into my OLAP, I change the structure as follows:

CREATE TABLE TransactionState
(
    TransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
    TransactionStateName VarChar (100) NOT NULL,
    StartDateTime DateTime NOT NULL,
    EndDateTime NULL
)

My question is regarding the TransactionStateId column. Over time, I may have duplicate TransactionStateId values in my OLAP, but with the combination of StartDateTime and EndDateTime, they would be unique.

I have seen samples of Type-2 Dimensions where an OriginalTransactionStateId is added and the incoming TransactionStateId is mapped to it, plus a new TransactionStateId IDENTITY field becomes the PK and is used for the joins.

CREATE TABLE TransactionState
(
    TransactionStateId INT IDENTITY (1, 1) NOT NULL,
    OriginalTransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
    TransactionStateName VarChar (100) NOT NULL,
    StartDateTime DateTime NOT NULL,
    EndDateTime NULL
)

Should I go with bachellorete #2 or bachellorete #3?

Best Answer

By this phrase:

With the combination of StartDateTime and EndDateTime, they would be unique.

you mean that they never overlap or that they satisfy the database UNIQUE constraint?

If the former, then you can use the StartDateTime in joins, but note that it may be inefficient, since it will use a "<=" condition instead of "=".

If the latter, then just use a fake identity.

Databases in general do not allow an efficient algorithm for this query:

SELECT  *
FROM    TransactionState
WHERE   @value BETWEEN StartDateTime AND EndDateTime

, unless you do arcane tricks with SPATIAL data.

That's why you'll have to use this condition in a JOIN:

SELECT  *
FROM    factTable
CROSS APPLY
        (
        SELECT  TOP 1 *
        FROM    TransactionState
        WHERE   StartDateTime <= factDateTime
        ORDER BY
                StartDateTime DESC
        )

, which will deprive the optimizer of possibility to use HASH JOIN, which is most efficient for such queries in many cases.

See this article for more details on this approach:

Converting currencies

Rewriting the query so that it can use HASH JOIN resulted in 600% times performance gain, though it's only possible if your datetimes have accuracy of a day or lower (or a hash table will grow very large).

Since your time component is stripped of your StartDateTime and EndDateTime, you can create a CTE like this:

WITH    cal AS
        (
        SELECT CAST('2009-01-01' AS DATE) AS cdate
        UNION ALL
        SELECT DATEADD(day, 1, cdate)
        FROM   cal
        WHERE  cdate <= '2009-03-01'
        ),
        state AS
        (
        SELECT  cdate, ts.*
        FROM    cal
        CROSS APPLY
                (
                SELECT  TOP 1 *
                FROM    TransactionState
                WHERE   StartDateTime <= cdate
                ORDER BY
                        StartDateTime DESC
                ) ts
        WHERE   ts.EndDateTime >= cdate
        )
SELECT  *
FROM    factTable
JOIN    state
ON      cdate = DATE(factDate)

If your date ranges span more than 100 dates, adjust MAXRECURSION option on CTE.

Generic example

UPDATE A
SET foo = B.bar
FROM TableA A
JOIN TableB B
    ON A.col1 = B.colx
WHERE ...

Sql – What are the options for storing hierarchical data in a relational database

My favorite answer is as what the first sentence in this thread suggested. Use an Adjacency List to maintain the hierarchy and use Nested Sets to query the hierarchy.

The problem up until now has been that the coversion method from an Adjacecy List to Nested Sets has been frightfully slow because most people use the extreme RBAR method known as a "Push Stack" to do the conversion and has been considered to be way to expensive to reach the Nirvana of the simplicity of maintenance by the Adjacency List and the awesome performance of Nested Sets. As a result, most people end up having to settle for one or the other especially if there are more than, say, a lousy 100,000 nodes or so. Using the push stack method can take a whole day to do the conversion on what MLM'ers would consider to be a small million node hierarchy.

I thought I'd give Celko a bit of competition by coming up with a method to convert an Adjacency List to Nested sets at speeds that just seem impossible. Here's the performance of the push stack method on my i5 laptop.

Duration for     1,000 Nodes = 00:00:00:870 
Duration for    10,000 Nodes = 00:01:01:783 (70 times slower instead of just 10)
Duration for   100,000 Nodes = 00:49:59:730 (3,446 times slower instead of just 100) 
Duration for 1,000,000 Nodes = 'Didn't even try this'

And here's the duration for the new method (with the push stack method in parenthesis).

Duration for     1,000 Nodes = 00:00:00:053 (compared to 00:00:00:870)
Duration for    10,000 Nodes = 00:00:00:323 (compared to 00:01:01:783)
Duration for   100,000 Nodes = 00:00:03:867 (compared to 00:49:59:730)
Duration for 1,000,000 Nodes = 00:00:54:283 (compared to something like 2 days!!!)

Yes, that's correct. 1 million nodes converted in less than a minute and 100,000 nodes in under 4 seconds.

You can read about the new method and get a copy of the code at the following URL. http://www.sqlservercentral.com/articles/Hierarchy/94040/

I also developed a "pre-aggregated" hierarchy using similar methods. MLM'ers and people making bills of materials will be particularly interested in this article. http://www.sqlservercentral.com/articles/T-SQL/94570/

If you do stop by to take a look at either article, jump into the "Join the discussion" link and let me know what you think.

Best Answer

Related Solutions

SQL update query using joins

Generic example

Sql – What are the options for storing hierarchical data in a relational database

Related Topic