How to count frequency of strings

data structuresstrings

I have 3 billion strings. I want to make a frequency map so I can discard strings that occur fewer than 100 times or more than 100,000 times. What kind of data structure(s) should I use? I'm thinking some kind of bloom filter.

Best Answer

If there are few enough unique strings to fit memory, just use a Dictionary<string, uint> where the key is the string and the value its count.

If the unique strings don't fit memory, you can use a bloom-filter like data-structure where you store a counter for each hash instead of a bit for each hash. Fill it in a first pass over the data. Then each string with sufficiently many occurrences will have the counter for all its associated hashes over the threshold (100 in your case). In the second pass, use the counting dictionary approach, but only on strings that aren't eliminated by the bloom-filter.

Related Solutions

Algorithms – How to Search for a Sub-string in an Array of Strings

As a pre-processing step turn your list into a trie.

a trie, also called digital tree or prefix tree, is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest...

The term trie comes from re-trie-val...

Should a String Constant Be Defined if Used Only Once?

Try this. The initial reflection is certainly expensive, but if you're going to use it many many times, which I think you will, this is most certainly a better solution what what you're proposing. I don't like using reflection, but I find myself using it when I don't like the alternative to reflection. I do think that this will save your team a lot of headache, but you must pass the name of the method (in lowercase).

In other words, rather than pass "name", you would pass "fullname" because the name of the get method is "getFullName()".

Map<String, Method> methodMapping = null;

public Object getNode(String name) {
    Map<String, Method> methods = getMethodMapping(contact.getClass());
    return methods.get(name).invoke(contact);
}

public Map<String, Method> getMethodMapping(Class<?> contact) {
    if(methodMapping == null) {
        Map<String, Method> mapping = new HashMap<String, Method>();
        Method[] methods = contact.getDeclaredMethods();
        for(Method method : methods) {
            if(method.getParameterTypes().length() == 0) {
                if(method.getName().startsWith("get")) {
                    mapping.put(method.getName().substring(3).toLower(), method);
                } else if (method.getName().startsWith("is"))) {
                    mapping.put(method.getName().substring(2).toLower(), method);
                }
            }
        }
        methodMapping = mapping;
    }
    return methodMapping;
}

If you need to access data contained within members of contact, you might consider building a wrapper class for contact which has all methods to access any information required. This would also be useful for guaranteeing that the names of the access fields will always remain the same (I.e. if wrapper class has getFullName() and you call with fullname, it will always work even if contact's getFullName() has been renamed -- it would cause compilation error before it would let you do that).

public class ContactWrapper {
    private Contact contact;

    public ContactWrapper(Contact contact) {
        this.contact = contact;
    }

    public String getFullName() {
        return contact.getFullName();
    }
    ...
}

This solution has saved me several times, namely when I wanted to have a single data representation to use in jsf datatables and when that data needed to be exported into a report using jasper (which doesn't handle complicated object accessors well in my experience).

Best Answer

Related Solutions

Algorithms – How to Search for a Sub-string in an Array of Strings

Should a String Constant Be Defined if Used Only Once?

Related Topic