Advantages of Non-Caching LINQ Implementations

frameworksimplementationslinqnet

This is a known pitfall for people who are getting their feet wet using LINQ:

public class Program
{
    public static void Main()
    {
        IEnumerable<Record> originalCollection = GenerateRecords(new[] {"Jesse"});
        var newCollection = new List<Record>(originalCollection);

        Console.WriteLine(ContainTheSameSingleObject(originalCollection, newCollection));
    }

    private static IEnumerable<Record> GenerateRecords(string[] listOfNames)
    {
        return listOfNames.Select(x => new Record(Guid.NewGuid(), x));
    }

    private static bool ContainTheSameSingleObject(IEnumerable<Record>
            originalCollection, List<Record> newCollection)
    {
        return originalCollection.Count() == 1 && newCollection.Count() == 1 &&
                originalCollection.Single().Id == newCollection.Single().Id;
    }

    private class Record
    {
        public Guid Id { get; }
        public string SomeValue { get; }

        public Record(Guid id, string someValue)
        {
            Id = id;
            SomeValue = someValue;
        }
    }
}

This will print "False", because for each name supplied to create the original collection, the select function keeps getting reevaluated, and the resulting Record object is created anew. To fix this, a simple call to ToList could be added at the end of GenerateRecords.

What advantage did Microsoft hope to gain by implementing it this way?

Why wouldn't the implementation simply cache the results an internal array? One specific part of what's happening may be deferred execution, but that could still be implemented without this behavior.

Once a given member of a collection returned by LINQ has been evaluated, what advantage is provided by not keeping an internal reference/copy, but instead recalculating the same result, as a default behavior?

In situations where there is a particular need in the logic for the same member of a collection recalculated over and over, it seems like that could be specified through an optional parameter and that the default behavior could do otherwise. In addition, the speed advantage that is gained by deferred execution is ultimately cut back against by the time it takes to continually recalculate the same results. Finally this is confusing block for those that are new to LINQ, and it could lead to subtle bugs in ultimately anyone's program.

What advantage is there to this, and why did Microsoft make this seemingly very deliberate decision?

Best Answer

What advantage was gained by implementing LINQ in a way that does not cache the results?

Caching the results would simply not work for everybody. As long as you have tiny amounts of data, great. Good for you. But what if your data is larger than your RAM?

It has nothing to do with LINQ, but with the IEnumerable<T> interface in general.

It is the difference between File.ReadAllLines and File.ReadLines. One will read the whole file into RAM, and the other will give it to you line by line, so you can work with large files (as long as they have line-breaks).

You can easily cache everything you want to cache by materializing your sequence calling either .ToList() or .ToArray() on it. But those of us who do not want to cache it, we have a chance to not do so.

And on a related note: how do you cache the following?

IEnumerable<int> AllTheZeroes()
{
    while(true) yield return 0;
}

You cannot. That's why IEnumerable<T> exists as it does.