Background
I am recently in the process of enduring grueling tech interviews for positions that use the .NET stack, some of which include silly questions like this one, and some questions that are more valid. I recently came across an issue that may be valid but I want to check with the community here to be sure.
When asked by an interviewer how I would count the frequency of words in a text document and rank the results, I answered that I would
- Use a stream object put the text file in memory as a string.
- Split the string into an array on spaces while ignoring punctuation.
- Use LINQ against the array to
.GroupBy()
and.Count()
, thenOrderBy()
said count.
I got this answer wrong for two reasons:
- Streaming an entire text file into memory could be disasterous. What if it was an entire encyclopedia? Instead I should stream one block at a time and begin building a hash table.
- LINQ is too expensive and requires too many processing cycles. I should have built a hash table instead and, for each iteration, only added a word to the hash table if it didn't otherwise exist and then increment it's count.
The first reason seems, well, reasonable. But the second gives me more pause. I thought that one of the selling points of LINQ is that it simply abstracts away lower-level operations like hash tables but that, under the veil, it is still the same implementation.
Question
Aside from a few additional processing cycles to call any abstracted methods, does LINQ require significantly more processing cycles to accomplish a given data iteration task than a lower-level task (such as building a hash table) would?
Best Answer
I'd say the main weakness of this answer is less its use of Linq and more the specific operators chosen.
GroupBy
takes each element and projects it to a key and a value which go into a lookup. In other words, every word will add something to the lookup.The naive implementation
.GroupBy(e => e)
will store a copy of every word in the source material, making the final lookup nearly as large as the original source material. Even if we project out the value with.GroupBy(e => e, e => null)
we're creating a large lookup of small values.What we would want is an operator that preserves only the needed information, which is one copy of each word and the count of that word so far. For that, we can use
Aggregate
:From here, there are several ways we could attempt to make this faster: