LINQ Performance – Does LINQ Require More Processing Cycles and Memory?

algorithmslinqnetperformance

Background

I am recently in the process of enduring grueling tech interviews for positions that use the .NET stack, some of which include silly questions like this one, and some questions that are more valid. I recently came across an issue that may be valid but I want to check with the community here to be sure.

When asked by an interviewer how I would count the frequency of words in a text document and rank the results, I answered that I would

  1. Use a stream object put the text file in memory as a string.
  2. Split the string into an array on spaces while ignoring punctuation.
  3. Use LINQ against the array to .GroupBy() and .Count(), then OrderBy() said count.

I got this answer wrong for two reasons:

  1. Streaming an entire text file into memory could be disasterous. What if it was an entire encyclopedia? Instead I should stream one block at a time and begin building a hash table.
  2. LINQ is too expensive and requires too many processing cycles. I should have built a hash table instead and, for each iteration, only added a word to the hash table if it didn't otherwise exist and then increment it's count.

The first reason seems, well, reasonable. But the second gives me more pause. I thought that one of the selling points of LINQ is that it simply abstracts away lower-level operations like hash tables but that, under the veil, it is still the same implementation.

Question

Aside from a few additional processing cycles to call any abstracted methods, does LINQ require significantly more processing cycles to accomplish a given data iteration task than a lower-level task (such as building a hash table) would?

Best Answer

I'd say the main weakness of this answer is less its use of Linq and more the specific operators chosen. GroupBy takes each element and projects it to a key and a value which go into a lookup. In other words, every word will add something to the lookup.

The naive implementation .GroupBy(e => e) will store a copy of every word in the source material, making the final lookup nearly as large as the original source material. Even if we project out the value with .GroupBy(e => e, e => null) we're creating a large lookup of small values.

What we would want is an operator that preserves only the needed information, which is one copy of each word and the count of that word so far. For that, we can use Aggregate:

words.Aggregate(new Dictionary<string, int>(), (counts, word) => 
{
    int currentCount;
    counts.TryGetValue(word, currentCount);
    counts[word] = currentCount + 1;
    return counts;
} 

From here, there are several ways we could attempt to make this faster:

  1. Instead of creating many strings while splitting, we could pass around structs that reference the original string and the segment that contains the word, and only copy the segment out when it turns out to be a unique key
  2. Use Parallel Linq to aggregate across several cores then combine the results. This is trivial compared to the leg work required for doing this by hand.