SQL Algorithm for Finding Resource Availability

algorithmsschedulingsql

I'm having trouble creating a mysql compatible algorithm for this.

Background

App with mysql, perl and JS. It's a booking system where each booking is comprised of a start, end and qty. Start and end are timestamps.

Schema simplified to a single table:

|  bookings        
|-------------------
| id    | pkey      
| start | timestamp 
| end   | timestamp 
| qty   | int

Question

In SQL, how do you check how many resources are booked at once for a given timeRange? Code with explanation or an algorithm that is SQL compatible both work.

So, for the following schedule:

09:00 -----               <-|
09:30 |   |                 | A maximum of 12 are booked at once during this range
10:00 |x7 |                 | 
10:30 ----- ----- -----     |
11:00       |   | |   |     |                       
11:30       |x2 | |x10|   <-|
12:00       |   | |   |
12:30       ----- -----

I should get 12 since the x2 and x10 bookings don't overlap with the x7 booking, so there are never more than 12 items booked at once between 9:00 and 11:30.

Progress

--It's been heavily shrunk to show just the relevant part, so it might have some errors
SELECT coalesce(max(qtyOverlap.sum),0) booked
FROM (
    SELECT coalesce(sum(b2.qty),0) sum
        FROM booking b1
        LEFT JOIN (
            SELECT b.qty, b.tStart, b.tEnd FROM booking b
        ) b2
        ON b1.tStart < b2.tEnd AND
           b1.tEnd > b2.tStart AND
           b2.tStart < '2015-02-19 16:30:00' AND
           b2.tEnd > '2015-02-19 06:00:00'
        WHERE 
              b1.tStart < '2015-02-19 16:30:00' AND
              b1.tEnd > '2015-02-19 06:00:00'
        GROUP BY b1.id
) qtyOverlap
GROUP BY qtyOverlap.itemId

Which is this algorithm:

Max of
    For each booking that overlaps given timeRange
        return sum of
            each booking that overlaps this booking and given timeRange

In the schedule above this would be:

max([7],[2+10],[10+2]) = 12

But given a schedule like:

09:00 -----               <-|
09:30 |   |                 | A maximum of 17 are booked at once during this range, not 19
10:00 |x7 |                 | 
10:30 |   |       -----     |
11:00 -----       |   |     |                       
11:30       ----- |x10|   <-|
12:00       |x2 | |   |
12:30       ----- -----

This gives:

max([7+10],[2+10],[10+7+2]) = 19

Which is wrong.

The only way I can think of to fix this is to use recursion (which isn't mysql compatible afaik).

It would look something like (in working JS code)

function getOverlaps(bookings,range) {
    return bookings.filter(function(booking){
        return isOverLapping(booking,range);
    });
}
function isOverLapping(a, b) {
    return (a.start < b.end && a.end > b.start);
}
function maxSum(booking, overlaps) { // main recursive function
    var currentMax = 0;
    var filteredOverlaps = getOverlaps(overlaps,booking);
    for (var i = 0; i < filteredOverlaps.length; i++) {
        currentMax = Math.max(
            maxSum(filteredOverlaps[i], removeElement(filteredOverlaps,i)),
            currentMax
        );
    }
    return currentMax + booking.qty;
}
function removeElement(array,i){
    var clone = array.slice(0)
    clone.splice(i,1);
    return clone;
}
var maxBooked = maxSum(timeRange, getOverlaps(bookings,timeRange));

Visual JSFiddle demo

Any way to do this in SQL? (any reasonable way, that is)

Update
I just tried to use a stored procedure recursion emulation method as documented here. But part way through implementing it, I tried it with the demo data and decided the ~~performance was far too terrible.~~ Actually, it just needed indexing. Now it's just 'kinda' bad.

Best Answer

This is tricky, because you've modeled your bookings as time intervals with granularity as fine as your DB allows. Perfectly natural to do, but as you've found out it makes some comparisons difficult.

Max of
    For each booking that overlaps given timeRange
        return sum of
            each booking that overlaps this booking and given timeRange

The problem with this algorithm is that it checks that each other booking interval matches the one currently being examined (the foreach iteration), but it doesn't check overlapping bookings against each other, to see if they line up. A run through of your second example goes like this:

Select 7x booking
- 7x does not overlap with 2x; +0
- 7x overlaps with 10x; +10
- Total 17
Select 2x booking
- 2x does not overlap with 7x; +0
- 2x overlaps with 10x; +10
- Total 12
Select 10x booking
- 10x overlaps with 7x; +7
- 10x overlaps with 2x; +2
- [Missing step: check if 7x and 2x overlap]
- Total 19
Max 19

Is it reasonably possible to map your data to nice, neat discrete chunks of some size? For example, do your bookings generally begin and end on the 15s (12:00, 12:15, 12:30, 12:45)? If so, you can change your algorithm to compare bookings against a static time interval, rather than each other, and drastically reduce the required number of comparisons:

Max of
  For each 15 minute chunk in timeRange
    Sum quantities of all bookings overlapping this chunk

In terms of SQL implementation, choose an interval size and use a numbers or tally table to generate an inline query to create the chunks:

select @startTime + interval (15 * numbers.value) minute as start
, @startTime + interval (15 * (numbers.value + 1)) minute as end
from numbers
where (@startTime + interval (15 * numbers.value) minute) < @endTime

(Off the cuff, may contain minor syntax or math errors)

This is a relatively sane way of performing this query in SQL without recursion. It has the obvious drawback that it will never perfectly align with your current schema, but do you really need absolute perfection?

I've used 15 minutes as an example size. You can easily make this as finely grained as you like: 5 minutes, 1 minute, 1 second, etc. There must be some point at which the granularity is too fine, because MySQL's timestamp type does not possess arbitrary precision. "Booking" to me implies something involving humans actually showing up. If this is true, I can't imagine that an interval size smaller than one minute would be appropriate.

In the comments you expressed some concern about performance because of a large number of comparisons. Complexity for this algorithm is O(n*m), where n is the number of chunks (time range / interval size) and m is the number of booking rows within the given time range. I'll hazard that in practice, n >> m, meaning that what really matters to the computation time is the number of intervals. This should be a non-issue, as long as you use sane timeframes and your DB is indexed and maintained correctly. For example, using an interval size of one second for the time range in your question (9:00 - 11:30) is only 9000 intervals to inspect. 9000 rows is paltry to an SQL server. I trust this to be performant a lot more than I do using dynamic SQL to emulate recursion.

If the interval size is 50 million times smaller than the time range, then yes, this will take a significant amount of time to run (notice I didn't say perform poorly), because you'll be running a query against 50 million rows. But is querying the maximum bookings for every millisecond in a twelve hour span (43.2 million ms) reasonable and necessary? There are only 604800 seconds in a week. Performing a query on a set that size, while not trivial, shouldn't give an SQL server any difficulty.

What does your data look like? How fine of an inspection period do you need? If there's a two minute (or second, decasecond, millisecond...) interval where there are 105 bookings instead of 100 because someone entered an "unusual" end time, will that destroy the integrity of your report, or can that data be discarded as noise? I can't answer these questions, but some simple data and requirements analysis on your part can.

Related Solutions

Unit Testing – Testing Random/Non-Deterministic Algorithms

What you actually want to test here, I assume, is that given a specific set of results from the randomiser, the rest of your method performs correctly.

If that's what you're looking for then mock out the randomiser, to make it deterministic within the realms of the test.

I generally have mock objects for all kinds of non-deterministic or unpredictable (at the time of writing the test) data, including GUID generators and DateTime.Now.

Edit, from comments: You have to mock the PRNG (that term escaped me last night) at the lowest level possible - ie. when it generates the array of bytes, not after you turn those into Int64s. Or even at both levels, so you can test your conversion to an array of Int64 works as intended and then test separately that your conversion to an array of DateTimes works as intended. As Jonathon said, you could just do that by giving it a set seed, or you can give it the array of bytes to return.

I prefer the latter because it won't break if the framework implementation of a PRNG changes. However, one advantage to giving it the seed is that if you find a case in production that didn't work as intended, you only need to have logged one number to be able to replicate it, as opposed to the whole array.

All this said, you must remember that it's called a Pseudo Random Number Generator for a reason. There may be some bias even at that level.

Algorithm for Finding Overlapping Times

Include all the start and end points (in time) of the Jobs in an array (this creates 2*N elements (1 for start 1 for end))
sort the array ordering by the timestamp of the event,

then iterate over the (2*N) elements as follows:

for each element X 
do 
  if(X.type == start)
    counter++
  else
    counter--
  ans=max(ans, counter);
end

Complexity: O(n.log(n)) for initial build of the sorted structure + O(n) for the iteration through its elements.

Edit: It has been suggested by Giorgio that an array is used, which is a better option. (The suggestion originally was to use a red/black tree, but no task removing capability seems to be needed, so maintaining order is trivial).