Java – Can we solve this Sock Merchant problem in less complexity

algorithmjava

I have solved the hackerrank Sock Merchant problem But I want to reduce the complexity of the code(I am not sure that it is possible or not).

John works at a clothing store. He has a large pile of socks that he must pair by color for sale. Given an array of integers representing the color of each sock, determine how many pairs of socks with matching colors there are.

For example, there are n=7 socks with colors ar= [1,2,1,2,1,3,2]. There is one pair of color 1 and one of color 2. There are three odd socks left, one of each color. The number of pairs is 2.

Function Description

Complete the sockMerchant function in the editor below. It must return an integer representing the number of matching pairs of socks that are available.

sockMerchant has the following parameter(s):

n: the number of socks in the pile

ar: the colors of each sock

Input Format

The first line contains an integer n, the number of socks represented in ar.
The second line contains n space-separated integers describing the colors ar[i] of the socks in the pile.

Constraints

1 <= n <= 100

1 <= ar[i] <= 100 where 0 <= i < n

Output Format

Return the total number of matching pairs of socks that John can sell.

Sample Input

9
10 20 20 10 10 30 50 10 20

Sample Output

My solutions :

package com.hackerrank.test;

public class Solution {
    public static void main(String[] args) {
        //Initialize array
        int[] arr = new int[]{10, 20, 20, 10, 10, 30, 50, 10, 20};
        //Array fr will store frequencies of element


        System.out.println("---------------------------------------");
        System.out.println(" sockMerchant output " + sockMerchant(9, arr));
        System.out.println("---------------------------------------");

    }

    static int sockMerchant(int n, int[] ar) {
        int pairs = 0;
        int frequencyArray[] = new int[ar.length];
        int frequencyTemp = -1;
        for (int i = 0; i < ar.length; i++) {
            int count = 1;
            for (int j = i + 1; j < ar.length; j++) {
                if (ar[i] == ar[j]) {
                    count++;
                    frequencyArray[j] = frequencyTemp;
                }
            }
            if (frequencyArray[i] != frequencyTemp) {
                frequencyArray[i] = count;
            }
        }

        for (int i = 0; i < frequencyArray.length; i++) {
            if (frequencyArray[i] != frequencyTemp) {
                int divide = frequencyArray[i] / 2;
                pairs += divide;
            }
        }
        return pairs;
    }
}

And the output is :

    ---------------------------------------
    sockMerchant frequency 3
    ---------------------------------------

Best Answer

You can solve this in a single pass (O(n)) using a HashSet, which has O(1) put and lookup time. Each element is already in the set, in which case it gets removed and the pair counter is incremented, or it's not, in which case you add it:

int[] arr = new int[]{10, 20, 20, 10, 10, 30, 50, 10, 20};

HashSet<Integer> unmatched = new HashSet<>();
int pairs = 0;
for(int i = 0; i < arr.length; i++) {
    if(!unmatched.add(arr[i])) {
        unmatched.remove(arr[i]);
        pairs++;
    }
}

How do you implement `buildHeap` so it runs in O(n) time?

Often, answers to these questions focus on the difference between siftUp and siftDown. Making the correct choice between siftUp and siftDown is critical to get O(n) performance for buildHeap, but does nothing to help one understand the difference between buildHeap and heapSort in general. Indeed, proper implementations of both buildHeap and heapSort will only use siftDown. The siftUp operation is only needed to perform inserts into an existing heap, so it would be used to implement a priority queue using a binary heap, for example.

I've written this to describe how a max heap works. This is the type of heap typically used for heap sort or for a priority queue where higher values indicate higher priority. A min heap is also useful; for example, when retrieving items with integer keys in ascending order or strings in alphabetical order. The principles are exactly the same; simply switch the sort order.

The heap property specifies that each node in a binary heap must be at least as large as both of its children. In particular, this implies that the largest item in the heap is at the root. Sifting down and sifting up are essentially the same operation in opposite directions: move an offending node until it satisfies the heap property:

siftDown swaps a node that is too small with its largest child (thereby moving it down) until it is at least as large as both nodes below it.
siftUp swaps a node that is too large with its parent (thereby moving it up) until it is no larger than the node above it.

The number of operations required for siftDown and siftUp is proportional to the distance the node may have to move. For siftDown, it is the distance to the bottom of the tree, so siftDown is expensive for nodes at the top of the tree. With siftUp, the work is proportional to the distance to the top of the tree, so siftUp is expensive for nodes at the bottom of the tree. Although both operations are O(log n) in the worst case, in a heap, only one node is at the top whereas half the nodes lie in the bottom layer. So it shouldn't be too surprising that if we have to apply an operation to every node, we would prefer siftDown over siftUp.

The buildHeap function takes an array of unsorted items and moves them until they all satisfy the heap property, thereby producing a valid heap. There are two approaches one might take for buildHeap using the siftUp and siftDown operations we've described.

Start at the top of the heap (the beginning of the array) and call siftUp on each item. At each step, the previously sifted items (the items before the current item in the array) form a valid heap, and sifting the next item up places it into a valid position in the heap. After sifting up each node, all items satisfy the heap property.
Or, go in the opposite direction: start at the end of the array and move backwards towards the front. At each iteration, you sift an item down until it is in the correct location.

Which implementation for `buildHeap` is more efficient?

Both of these solutions will produce a valid heap. Unsurprisingly, the more efficient one is the second operation that uses siftDown.

Let h = log n represent the height of the heap. The work required for the siftDown approach is given by the sum

(0 * n/2) + (1 * n/4) + (2 * n/8) + ... + (h * 1).

Each term in the sum has the maximum distance a node at the given height will have to move (zero for the bottom layer, h for the root) multiplied by the number of nodes at that height. In contrast, the sum for calling siftUp on each node is

(h * n/2) + ((h-1) * n/4) + ((h-2)*n/8) + ... + (0 * 1).

It should be clear that the second sum is larger. The first term alone is hn/2 = 1/2 n log n, so this approach has complexity at best O(n log n).

How do we prove the sum for the `siftDown` approach is indeed O(n)?

One method (there are other analyses that also work) is to turn the finite sum into an infinite series and then use Taylor series. We may ignore the first term, which is zero:

If you aren't sure why each of those steps works, here is a justification for the process in words:

The terms are all positive, so the finite sum must be smaller than the infinite sum.
The series is equal to a power series evaluated at x=1/2.
That power series is equal to (a constant times) the derivative of the Taylor series for f(x)=1/(1-x).
x=1/2 is within the interval of convergence of that Taylor series.
Therefore, we can replace the Taylor series with 1/(1-x), differentiate, and evaluate to find the value of the infinite series.

Since the infinite sum is exactly n, we conclude that the finite sum is no larger, and is therefore, O(n).

Why does heap sort require O(n log n) time?

If it is possible to run buildHeap in linear time, why does heap sort require O(n log n) time? Well, heap sort consists of two stages. First, we call buildHeap on the array, which requires O(n) time if implemented optimally. The next stage is to repeatedly delete the largest item in the heap and put it at the end of the array. Because we delete an item from the heap, there is always an open spot just after the end of the heap where we can store the item. So heap sort achieves a sorted order by successively removing the next largest item and putting it into the array starting at the last position and moving towards the front. It is the complexity of this last part that dominates in heap sort. The loop looks likes this:

for (i = n - 1; i > 0; i--) {
    arr[i] = deleteMax();
}

Clearly, the loop runs O(n) times (n - 1 to be precise, the last item is already in place). The complexity of deleteMax for a heap is O(log n). It is typically implemented by removing the root (the largest item left in the heap) and replacing it with the last item in the heap, which is a leaf, and therefore one of the smallest items. This new root will almost certainly violate the heap property, so you have to call siftDown until you move it back into an acceptable position. This also has the effect of moving the next largest item up to the root. Notice that, in contrast to buildHeap where for most of the nodes we are calling siftDown from the bottom of the tree, we are now calling siftDown from the top of the tree on each iteration! Although the tree is shrinking, it doesn't shrink fast enough: The height of the tree stays constant until you have removed the first half of the nodes (when you clear out the bottom layer completely). Then for the next quarter, the height is h - 1. So the total work for this second stage is

h*n/2 + (h-1)*n/4 + ... + 0 * 1.

Notice the switch: now the zero work case corresponds to a single node and the h work case corresponds to half the nodes. This sum is O(n log n) just like the inefficient version of buildHeap that is implemented using siftUp. But in this case, we have no choice since we are trying to sort and we require the next largest item be removed next.

In summary, the work for heap sort is the sum of the two stages: O(n) time for buildHeap and O(n log n) to remove each node in order, so the complexity is O(n log n). You can prove (using some ideas from information theory) that for a comparison-based sort, O(n log n) is the best you could hope for anyway, so there's no reason to be disappointed by this or expect heap sort to achieve the O(n) time bound that buildHeap does.

How to pair socks from a pile efficiently

Sorting solutions have been proposed, but sorting is a little too much: We don't need order; we just need equality groups.

So hashing would be enough (and faster).

For each color of socks, form a pile. Iterate over all socks in your input basket and distribute them onto the color piles.
Iterate over each pile and distribute it by some other metric (e.g. pattern) into the second set of piles
Recursively apply this scheme until you have distributed all socks onto very small piles that you can visually process immediately

This kind of recursive hash partitioning is actually being done by SQL Server when it needs to hash join or hash aggregate over huge data sets. It distributes its build input stream into many partitions which are independent. This scheme scales to arbitrary amounts of data and multiple CPUs linearly.

You don't need recursive partitioning if you can find a distribution key (hash key) that provides enough buckets that each bucket is small enough to be processed very quickly. Unfortunately, I don't think socks have such a property.

If each sock had an integer called "PairID" one could easily distribute them into 10 buckets according to PairID % 10 (the last digit).

The best real-world partitioning I can think of is creating a rectangle of piles: one dimension is color, the other is the pattern. Why a rectangle? Because we need O(1) random-access to piles. (A 3D cuboid would also work, but that is not very practical.)

Update:

What about parallelism? Can multiple humans match the socks faster?

The simplest parallelization strategy is to have multiple workers take from the input basket and put the socks onto the piles. This only scales up so much - imagine 100 people fighting over 10 piles. The synchronization costs (manifesting themselves as hand-collisions and human communication) destroy efficiency and speed-up (see the Universal Scalability Law!). Is this prone to deadlocks? No, because each worker only needs to access one pile at a time. With just one "lock" there cannot be a deadlock. Livelocks might be possible depending on how the humans coordinate access to piles. They might just use random backoff like network cards do that on a physical level to determine what card can exclusively access the network wire. If it works for NICs, it should work for humans as well.
It scales nearly indefinitely if each worker has its own set of piles. Workers can then take big chunks of socks from the input basket (very little contention as they are doing it rarely) and they do not need to synchronise when distributing the socks at all (because they have thread-local piles). At the end, all workers need to union their pile-sets. I believe that can be done in O(log (worker count * piles per worker)) if the workers form an aggregation tree.

What about the element distinctness problem? As the article states, the element distinctness problem can be solved in O(N). This is the same for the socks problem (also O(N), if you need only one distribution step (I proposed multiple steps only because humans are bad at calculations - one step is enough if you distribute on md5(color, length, pattern, ...), i.e. a perfect hash of all attributes)).

Clearly, one cannot go faster than O(N), so we have reached the optimal lower bound.

Although the outputs are not exactly the same (in one case, just a boolean. In the other case, the pairs of socks), the asymptotic complexities are the same.

Best Answer

Related Solutions

How to building a heap be O(n) time complexity

How do you implement buildHeap so it runs in O(n) time?

Which implementation for buildHeap is more efficient?

How do we prove the sum for the siftDown approach is indeed O(n)?

Why does heap sort require O(n log n) time?

How to pair socks from a pile efficiently

Related Topic

How do you implement `buildHeap` so it runs in O(n) time?

Which implementation for `buildHeap` is more efficient?

How do we prove the sum for the `siftDown` approach is indeed O(n)?