SQL – Algorithm for Finding Most Overlapping Events

algorithmssql

The Problem

I'm looking for a query to help me solve the following:

I have a series of events
Each event has a start and end date.
Many of these events overlap
The answer I'm looking for is the maximum number of events that overlap

Example

Say I have 5 events:

1 Jan -> 9 Jan
7 Jan -> 12 Jan
8 Jan -> 10 Jan
10 Jan -> 15 Jan
12 Jan -> 17 Jan

3 of these events overlap 9th January, which is the maximum overlapping events, so the answer is 3.

(There are also 3 events overlapping on 10th January, but that's the same answer)

What I've Tried

If I was doing this in memory, I could do this:

For each event:
- Get the start date
- Count the dates that include this event
Pick the event with the highest count.

But there are 2 issues with this:

It doesn't appear to be very efficient
It isn't very SQL-y (ie. is procedural rather than set-based)

Question

How can I implement something like this in SQL?

Notes

I don't care to find the start / end dates of the most overlapping events. I just need a count
I don't care how often the maximum occurs, I just need the maximum. So, in the example above, I know there are two occasions where 3 events overlap, but I just need the "3".
If event A ends on the same day as event B starts, they are considered to be overlapping

Best Answer

It might be more performant to write this in code, not sql.

In code, I would sort the items by start date, then end date. Walk through them and check if it overlaps with the next item. If it does, increment its overlap counter and repeat: check the next item overlaps with your item. If it doesn't, move on to the next.

In SQL, you can do it with a self join to list all the overlaps.

This will show you all overlaps:

select a.eventid from events a
inner join events b 
on a.end > b.start and a.start < b.end

You can then group them by eventid, select the count and take the event with the max count.

select top 1 eventid, count(*) as c
from 
    (select a.eventid from events a
    inner join events b 
    on a.end > b.start and a.start < b.end)
group by eventid
order by c desc

Related Solutions

Database Design – Should Date and Time Fields Be Merged

One minor point: WHEN you merge the two columns, you might want to do the merge into a new "MONKEY_DATE_2" column instead of overwriting the existing one. That leaves your current columns unchanged, and you can find all the code that hasn't been updated to work with the new structure with grep.

Database – Logic design question for SQL query

Depending on platform, this may or may not be possible in a single sql query/without use of a stored procedure.

This problem is effectively equivalent to tree traversal within a table with parent references. You can construct the tree structure through a self join using a query like this:

SELECT 
    b. *,
    a.event as parent_event,
    a.start as parent_start,
    a.end as parent_end
FROM
    interval_test a
        join
    interval_test b ON b.start >= a.end
order by parent_start, start

Producing a result that looks like this:

Parent child hierarchy for showing potential predecessor relationships

Note: the result is truncated to save space, but you get the idea...

Each of the possible successors of an event (those with an end >= start of the predecessor) can be thought of as a child in the tree. You want to find the tree path down the tree that minimizes the start time at each step.

You can do this kind of tree traversal by using recursive queries within SQL Server or using a hierarchical query in Oracle. This kind of recursion is not supported in MySQL and you will need to use a stored procedure if that is your platform.

For example, in SQL Server, the following query would work:

        WITH min_path AS
          (SELECT RANK() OVER (ORDER BY a.start ASC) AS [rank], cast(NULL as CHAR(10)) as parent_event, 
               NULL as parent_start, NULL as parent_end, a.event, a.start, a.[end]
           FROM dbo.interval_test a
           WHERE a.start=
               (SELECT MIN(START)
                FROM dbo.interval_test)
           UNION ALL SELECT RANK() OVER (ORDER BY c.start ASC) AS [rank], b.event as parent_event, 
               b.start as parent_start, b.[end] as parent_end, c.event, c.start, c.[end]
           FROM min_path b
           JOIN dbo.interval_test c ON c.start>= b.[end]
           WHERE [rank]=1)
        SELECT event, start, [end] from min_path where [rank]=1

Producing the following result:

Results for tree traversal query

Note, there are probably cleaner ways to write the query above for SQL Server.

From a performance perspective, when you are working in an environment with potentially billions of rows as mentioned in the comments above, I don't have a clear idea of how this might perform. If the query is limited to a small number of events related to a single user it might be fine. Trying to find the whole set of relevant non-overlapping events would probably not work well and you might be better off using a procedural approach.