Database – Logic design question for SQL query

databasedesignlogicsql

Was wondering if you could help me on a SQL problem I’m having.

I have a set of records of events where each event has a start time and end time. No event has the same start time, and the end time for each event is greater than or equal to the preceding event.

What I want to be able to do is pull out the set of records that are not overlapping – where the start time for an event is not between the start/end for any other event. In the example below, the start time for B/C/D are in-between A’s start and end, so I want to keep A, and get rid of B,C,D. Then start over again, keeping E, and get rid of F, then start over and keep G, getting rid of H/I, etc.

Seems like this requires a self join but I can’t figure out the criteria for that would identify the records this way. Any ideas?

Event |  Start  | End
A             1            10 (keep)
B             2            11 (discard because start time of 2 < 10)
C             5            15 (discard because start time of 5 < 10)
D             9            12 (discard because start time of 9 < 10)
E             10           13 (keep -  this starts a new set because it’s the first record after A with start >= A end)
F             11           20 (discard because 11 < 13)
G             14           22 (keep -  this starts a new set because it’s the first record after E with start >= E)
H             15           22 (discard because 15 < 22)
I             17           27 (discard because 17 < 22)
J             22           27 (keep -  this starts a new set because it’s the first record after G with start >= G)

Best Answer

Depending on platform, this may or may not be possible in a single sql query/without use of a stored procedure.

This problem is effectively equivalent to tree traversal within a table with parent references. You can construct the tree structure through a self join using a query like this:

SELECT 
    b. *,
    a.event as parent_event,
    a.start as parent_start,
    a.end as parent_end
FROM
    interval_test a
        join
    interval_test b ON b.start >= a.end
order by parent_start, start

Producing a result that looks like this:

Parent child hierarchy for showing potential predecessor relationships

Note: the result is truncated to save space, but you get the idea...

Each of the possible successors of an event (those with an end >= start of the predecessor) can be thought of as a child in the tree. You want to find the tree path down the tree that minimizes the start time at each step.

You can do this kind of tree traversal by using recursive queries within SQL Server or using a hierarchical query in Oracle. This kind of recursion is not supported in MySQL and you will need to use a stored procedure if that is your platform.

For example, in SQL Server, the following query would work:

        WITH min_path AS
          (SELECT RANK() OVER (ORDER BY a.start ASC) AS [rank], cast(NULL as CHAR(10)) as parent_event, 
               NULL as parent_start, NULL as parent_end, a.event, a.start, a.[end]
           FROM dbo.interval_test a
           WHERE a.start=
               (SELECT MIN(START)
                FROM dbo.interval_test)
           UNION ALL SELECT RANK() OVER (ORDER BY c.start ASC) AS [rank], b.event as parent_event, 
               b.start as parent_start, b.[end] as parent_end, c.event, c.start, c.[end]
           FROM min_path b
           JOIN dbo.interval_test c ON c.start>= b.[end]
           WHERE [rank]=1)
        SELECT event, start, [end] from min_path where [rank]=1

Producing the following result:

Results for tree traversal query

Note, there are probably cleaner ways to write the query above for SQL Server.

From a performance perspective, when you are working in an environment with potentially billions of rows as mentioned in the comments above, I don't have a clear idea of how this might perform. If the query is limited to a small number of events related to a single user it might be fine. Trying to find the whole set of relevant non-overlapping events would probably not work well and you might be better off using a procedural approach.

Related Topic