Algorithms – Grouping Numbers to Minimize Group-Means

algorithmsclustertime

I need to find a way or an algorithm that groups members of a given data set (of positive integers) so that the difference between group means is minimized (not maximized, as usual).

There are two constraints:

  1. The number of groups should not exceed log(N) base 2. N is the input array size. Let us assume that N = 16, always.
  2. The size of the group should be at least log(N) base 2. In the below example, the size of the group should be at least 4.

By searching on the WEB, I have found the following greedy algorithm. Please also see below example.

  1. sort the numbers in the descending order
  2. take the first K elements and put them into different groups. Here, K is the number of groups.
  3. for the next (N – K) elements, put them in the group with the lowest sum. repeat this until completion of all the numbers.

Can we get a better algorithm in terms of time complexity? Guidance toward a better solution to this problem is appreciated.

Example:

for input array = (11, 11, 14, 16, 17, 18, 18, 19, 20, 21, 22, 25, 25, 26, 28, 31)

The solution:  (where sd: standard deviation, cv: coefficient of variation)

    group           mean    sd  cv
(31, 20, 17, 11)    19.8    8.4 0.4 
(28, 21, 18, 11)    19.5    7.1 0.4
(26, 21, 19, 14)    20.0    5.0 0.3
(25, 22, 18, 16)    20.3    4.0 0.2

Best Answer

Intuitively, even though your constaints are relaxed, I think your problem is an instance of the bin packing problem, which is NP-hard. Perhaps you should ask the CS Theory group whether this is an instance of BPP/generalized BPP.

Your greedy algorithm works in linearithmic complexity (sorting in O(n log n), binning in O(n log log n), assuming using heaps, because your number of bins is ~ log n), but it is not optimal (easy to find sequences which trick it).

You can do a best-fit greedy in O(n log log n) - find the mean of the numbers in linear time (without sorting), and then for each number try to fit it in a bin such that current average stays as close as possible to the total average - this should be a log-logarithmic operation with respect to n. Whether this is an appropriate solution for your problem depends on how far from optimal you can afford to be.

However, note that you can't go better than this O(n log log n) - because whatever clever scheme you use, you will need to pick each element up, and bin it.

Related Topic