How to count frequency of strings

data structuresstrings

I have 3 billion strings. I want to make a frequency map so I can discard strings that occur fewer than 100 times or more than 100,000 times. What kind of data structure(s) should I use? I'm thinking some kind of bloom filter.

Best Answer

If there are few enough unique strings to fit memory, just use a Dictionary<string, uint> where the key is the string and the value its count.

If the unique strings don't fit memory, you can use a bloom-filter like data-structure where you store a counter for each hash instead of a bit for each hash. Fill it in a first pass over the data. Then each string with sufficiently many occurrences will have the counter for all its associated hashes over the threshold (100 in your case). In the second pass, use the counting dictionary approach, but only on strings that aren't eliminated by the bloom-filter.

Related Topic