Java – Complexity of ArrayList of LinkedHashSet

complexityhashingjava

I get input strings from the console like this:

while ((currentLine = bufferedReader.readLine()) != null ) {
    StringTokenizer string = new StringTokenizer(currentLine, " ");
    while (string.hasMoreTokens()) {
        // Create a new LinkedHashSet for every token and then add it to the ArrayList.
        LinkedHashSet<String> linkedHashSet = new LinkedHashSet<String>();
        linkedHashSet.add(string.nextToken());
        setOfStrings.add(linkedHashSet);
    }
}

I'm always getting different strings from input, never the same. After finishing filling in the data structures I have this situation:

  • An ArrayList<LinkedHashSet<String>> which contains one LinkedHashSet for each string split.
  • Inside each LinkedHashSet I have a string that is different from any other string present in the other LinkedHashSets – in other words, its unique. For example, I can't have this inside the ArrayList:

    Set x : [foo]
    ...
    Set y : [foo] 
    

After doing that, I call the function below many times
to make different merges.

public void mergeSets(ArrayList<String> operations, ArrayList<LinkedHashSet<String>> setOfStrings) {
    String toMerge = operations.get(1);
    String fromMerge = operations.get(2);
    boolean enteredFirstToMerge = false;
    boolean enteredFirstFromMerge = false;
    // Temporary LinkedHashSet reference used to merge two sets.
    LinkedHashSet<String> subSetToMerge = null;
    LinkedHashSet<String> subSetFromMerge = null;
    for (Iterator<LinkedHashSet<String>> iterator = setOfStrings.iterator();
            iterator.hasNext(); ) {
        LinkedHashSet<String> subSet = iterator.next();
        if (subSet.contains(toMerge) && subSet.contains(fromMerge))
            break;
        else {
            if (subSet.contains(toMerge) && !enteredFirstToMerge) {
                enteredFirstToMerge = true;
                subSetToMerge = subSet;
                iterator.remove();
            } else if (subSet.contains(fromMerge) && !enteredFirstFromMerge) {
                enteredFirstFromMerge = true;
                subSetFromMerge = subSet;
            }
        }
        if (enteredFirstFromMerge && enteredFirstToMerge)
            break;
        }
        if (enteredFirstFromMerge && enteredFirstToMerge) {
            subSetFromMerge.addAll(subSetToMerge);
        }
    }

Explanation:

For example, if I have as operation merge foo bar, I have to do these steps:

  • First of all, I have to find where foo and bar are located:

    • Inside setOfStrings, I can have this situation:

          position x : [bar, tree, hotel]
          ...
          position y : [foo, lemon, coffee] 
      

When I find them, I have to combine the set which contains foo with the set that contains bar this way:

            position x : {bar tree hotel foo lemon coffee}
            ...
            position y : {} -> deleted from the arrayList

This function takes as parameters an arrayList of operations and an arrayList<LinkedHashSet<String>>:

For the ArrayList of operations, I always get a specific position:

  • operations.get(1) refers to the set to merge (foo in this example)
  • operations.get(2) refers to the set where to add the foo set (bar in this example)

With this for loop, I iterate the ArrayList to search the to sets,for (Iterator<LinkedHashSet<String>> iterator = setOfStrings.iterator(); iterator.hasNext(); )

This if statement checks if the iterator is in the specific set:

if (subSet.contains(toMerge) && !enteredFirstToMerge) {
    enteredFirstToMerge = true;
    subSetToMerge = subSet;
    iterator.remove();
} else if (subSet.contains(fromMerge) && !enteredFirstFromMerge) {
    enteredFirstFromMerge = true;
    subSetFromMerge = subSet;
}

My question is: Could I have collisions with this type of algorithm that I implemented?

If not the time complexity is only O(n) -> the size of the arrayList.

Best Answer

Your main question is whether you could have a collision, but it's first important to determine where you might have a collision. An ArrayList does not have collisions because it is an ordered list on top of an array. Your concern there would be more about how often it has to extend the capacity of the underlying array. You would feel this when you are reading the input only, though.

Where you might have collisions is in the LinkedHashSets. Since hash collisions are a function of the capacity of the underlying hash store, the hashing algorithm, and the data itself, there is always a possibility for a collisions, and there is no way to know if you'll have one except to run all of your data through. In the end, since these are all different strings, and the hashing algorithm for strings built into Java is pretty effective, the remaining factor to determine how many collisions you could potentially have is the capacity of the underlying hash store.

A default LinkedHashSet has an initial capacity of 16 and a load factor of 0.75. This means that once there are 12 elements in the LinkedHashSet, it will double its capacity to accommodate more elements. If we can assume that hashes are evenly distributed, a larger capacity means a lower probability of a collision. Resizing a LinkedHashSet is actually the most expensive activity. In order to resize the underlying hash store, it essentially must iterate through the current elements, rehash them, and re-insert them into the larger hash store. Your best strategy to avoid collisions and rehashing is to allocate large LinkedHashSets when you instantiate them. What "large" is depends on how much data you are inputting and how much of it will be combined. If you know the size and general shape of your data before hand, you could try to estimate a proper initial size large enough to accommodate the largest set (needed size / 0.75 to account for the load factor). This essentially becomes the old time vs. space trade-off, as larger initial sizes means more memory consumed.


Other general suggestions

StringTokenizer is deprecated, and string.split(" ") is generally recommended in its place. However, since you are getting your input from the console, I would suggest disposing of the buffered reader completely and wrapping System.in in a Scanner. Scanner has a next() method that fetches the next token, implicitly tokenizing the input for you with less effort.

Also, while you are using ArrayList and LinkedHashSet for their performance characteristics, it is probably still better to rely on their interfaces (List and Set respectively) than those specific implementations.

Third, instead of writing out all of the code to get an iterator and then looping over it, you can use the for-each construct to loop over any Iterable.

Fourth, you don't need boolean flags when you can just as easily get the same information from a null check. We can eliminate two variables right off the bat.

Here's a slightly rewritten version incorporating those suggestions:

Main.java

import java.io.*;
import java.util.*;

public class Main
{
    public static void main(String[] args) throws IOException {
        Scanner input = new Scanner(System.in);
        List<Set<String>> setOfStrings = new ArrayList<>();

        while (input.hasNext()) {
            Set<String> tokenSet = new LinkedHashSet<>();
            tokenSet.add(input.next());
            setOfStrings.add(tokenSet);
        }

        SetMerger sm = new SetMerger();
        sm.mergeSets("foo", "bar", setOfStrings);
        sm.mergeSets("bar", "baz", setOfStrings);
        System.out.println(setOfStrings.toString());
    }
}

SetMerger.java

import java.util.*;

public class SetMerger
{
    public void mergeSets(String toMerge, String fromMerge, List<Set<String>> stringSets) {
        Set<String> setToMerge = null, targetSet = null;

        for (Set<String> subset : stringSets) {
            if(subset.contains(toMerge)) {
                if(subset.contains(fromMerge)) {
                    return;
                }

                setToMerge = subset;
                break;
            }
        }

        for (Set<String> subset : stringSets) {
            if(subset.contains(fromMerge)) {
                targetSet = subset;
                break;
            }
        }

        if(setToMerge != null && targetSet != null) {
            targetSet.addAll(setToMerge);
            stringSets.remove(setToMerge);
        }
    }
}