Algorithms – Finding the Largest Subarray in a 2D Array with N Different Numbers

algorithms

Let's say I have a 2d array 100×100 in size, each cell in that array has a number from 1 to 50 randomly.

How do I find the biggest subarray in a rectangle size in that array that has only n different numbers?

Small example:

array: 1 2 3 3 3 2 2
       7 4 3 3 4 9 8
       1 2 3 3 3 4 2
       7 6 1 9 9 4 2
In this case the biggest sub rectangle with only 1 number allowed would be 
       3 3
       3 3
With 2 numbers allowed it would be
       3 3 3
       3 3 4
       3 3 3

Does something for this exist?

Best Answer

If the array has a size of n x m, then there are roughly n^2 m^2 / 4 subarrays. If there are k different values, then we can represent each as a bit array with k bits. This takes for example k' := ceil (k / 64) 64 bit words.

It's quite easy to calculate each of the sets of values for n^2 m^2 / 4 subarray in a total of n^2 m^2 k' / 4 steps. The number of bits can be counted in log2 (k) steps, so we have a total of n^2 m^2 k' log2 (k) / 4 operations. In this case, 10,000 * 10,000 * 1 * 6 / 4 = 150 million steps. So no big deal. It gets more interesting if the values are larger.

However, this brute force search would be quite inefficient in this case (where each cell contains a random number). We can reasonably easy calculate the probabibility that a rectangle of n numbers contains k or fewer different numbers, and that probability will go down rapidly if n gets larger than k. So we would want to avoid checking really large rectangles. Here's how I would suggest to proceed, for an array of n rows and m columns, looking for the largest rectangle with at most k different numbers:

We handle trivial cases: k ≤ 0 means no solution, k ≥ number of different numbers means (1:n, 1:m) is the solution. Then find the largest trivial solution, which is just the largest rectangle with an area ≤ k. (For example, if the array is 5x5 and k = 7, we can trivially find a rectangle of size 6 but not of size 7). Let the size of the rectangle be s ≤ k. Let maxc (r) = "largest c where we haven't proved that a rectangle with r rows and c columns and less than k different numbers cannot exist".

Given a record s, we then need to find a rectangle which is larger or prove that it doesn't exist. We find a pair (r, c) with the smallest product rc such that c ≤ maxc (r) and rc > s. If such a pair doesn't exist then we have the optimal solution. Otherwise, we look for a rectangle of size r x c with k or fewer different numbers. If none is found, then we set maxc (i) = min (maxc (i), c - 1) for all i ≥ r (there cannot be a rectangle with r rows or more, and with c-1 columns or more) and try again. If a rectangle is found, then we record it, increase s, and also try again.