C# – How to Import Large CSV Files

ccsvimportlinq

I've been tasked to query 2 large csv files which are roughly 1 GB in size each. The files contain related data, so file one may contain a list of order numbers, Order Dates, etc and the file may contain the order lines, stock codes, quantity, etc.

Other Restrictions:

  • No Access to a database – so I can't just bulk insert and query
  • Must be able to add new queries at a later point

The kind of query I need to write is Sum all the lines grouped by Stock Code or How many orders were there between two dates.

My initial idea is to create two models representing the spreadsheet and then import the lot and query via Linq. However this may cause an issue due to the size of the data (and i can't use a database for lazy loading, etc). Also the querying of the models maybe quite slow.

The other idea I had was to create an import routine that evaluates the data line by line i.e. if I need to count how many stock codes there are in a file I can simply just query the one column in the spreadsheet on a line by line basis and then count them. This would mean that i would have to write a csv import and some sort of query class.

Any thoughts or ideas on how to handle this or the best approach? Is there a pattern for this already?

Best Answer

I do not know which, from the other answers, will work best for you.

AFAICT, I'd remark, mostly from an implementation standpoint, that they all make sense, but from a design perspective, which one will scale best, will be best maintainable, etc, it all depends on other factors you may have left out from your question as-is.

I can think of two.

1) the actual shape of the flat data you have already on hand (independently of the volume of it)

2) whether or not you have the comprehensive knowledge of a bounded, predictible set of the query criteria your application will need.

For instance, if one can make two more strong assumptions about that (and which you only know about) -- and beware, they are indeed very strong:

a) if we assume the answer to (2) is "yes";

and

b) the flat CSV is already sorted on: order id > order line, with order date strictly increasing after the first (order id).

Then radarbob's answer might be the best -- where all you'd need would be a fast line-by-line CSV reader -- implementing, e.g., IEnumerable < RawOrderLine > -- with little code to write, easy to maintain, and with no other dependencies.

E.g.:

(much hypothetical & untested code)

OrderNo         OrderLine       OrderDate       StockCode       Quantity        Etc

1001            1               1/4/2016        Product1        5               ...
1001            2               1/4/2016        Product2        3               ...
1002            1               5/7/2016        Product1        10              ...
1003            1               2/8/2016        Product2        4               ...

get all orders for given stock code and given date:

API:

IEnumerable<RawOrderLine> GetOrders(IEnumerable<RawOrderLine> csv, string stockCode, DateTime orderDate)

How to:

csv.Where(line => line.StockCode == stockCode && line.OrderDate == orderDate)

get all orders for given stock code and falling in given date range:

API:

IEnumerable<RawOrderLine> GetOrders(IEnumerable<RawOrderLine> csv, string stockCode, DateTime startDate, DateTime endDate)

How to:

csv.SkipWhile(line => line.OrderDate < startDate).
    TakeWhile(line => line.OrderDate <= endDate).
    Where(line => line.StockCode == stockCode)

get sum of quantities for given stock code of all orders falling in given date range:

API:

int GetQuantity(IEnumerable<RawOrderLine> csv, string stockCode, DateTime startDate, DateTime endDate)

How to:

csv.SkipWhile(line => line.OrderDate < startDate).
    TakeWhile(line => line.OrderDate <= endDate).
    Where(line => line.StockCode == stockCode).
    Aggregate
    (
        0,
        (sum, line) => sum += line.Quantity
    );

Etc, etc.

For a quick and dirty full sample, this below will generate exactly 3 millions random orders into the future, spaced by 1 to 2 minutes from each other (cf. OrderDate), and with 1 to 10 order lines each (of distinct product / quantity pairs) -- resulting in a ~ 500MB file (if in ASCII/ANSI encoding anyway).

The two sample queries (over the last 3 days of 2016, for "Product9") take only a couple seconds on my end (machine rather sluggish).

(the main thing which is annoyingly slow is to spit out that random sample data, in the first place)

//#define GENERATE_SAMPLE
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

/*
 * 
OrderNo         OrderLine       OrderDate       StockCode       Quantity        Etc
1001            1               1/4/2016        Product1        5               ...
1001            2               1/4/2016        Product2        3               ...
1002            1               5/7/2016        Product1        10              ...
1003            1               2/8/2016        Product2        4               ...
 * 
 */
namespace _332960
{
    public class RawOrderLine
    {
        public override string ToString()
        {
            return string.Format("{0},{1},{2},{3},{4}", OrderNo, OrderLine, OrderDate.ToString("yyyy-MM-dd HH:mm"), StockCode, Quantity);
        }

        public RawOrderLine Parse(string data)
        {
            var columns = data.Split(',');
            OrderNo = int.Parse(columns[0]);
            OrderLine = int.Parse(columns[1]);
            OrderDate = DateTime.Parse(columns[2]);
            StockCode = columns[3];
            Quantity = int.Parse(columns[4]);
            return this;
        }

        public int OrderNo { get; set; }

        public int OrderLine { get; set; }

        public DateTime OrderDate { get; set; }

        public string StockCode { get; set; }

        public int Quantity { get; set; }
    }

    public class RawOrderReader
    {
        public RawOrderReader(string filePath)
        {
            FilePath = filePath;
        }

        protected string FilePath { get; private set; }

        public IEnumerable<RawOrderLine> Data
        {
            get
            {
                using (var reader = new StreamReader(FilePath))
                {
                    string line;
                    while ((line = reader.ReadLine()) != null)
                    {
                        yield return new RawOrderLine().Parse(line);
                    }
                }
            }
        }
    }

    class Program
    {
        public const int BaseOrderCount = 1000 * 1000;

        public static string[] StockCodes =
            new[]
            {
                "Product0", "Product1", "Product2", "Product3", "Product4",
                "Product5", "Product6", "Product7", "Product8", "Product9"
            };
        static Random random = new Random(DateTime.Now.Millisecond);
        static DateTime lastOrderDate = DateTime.Now;
        static int lastOrderNo = BaseOrderCount - 1;

        static IEnumerable<RawOrderLine> NewRandomOrder()
        {
            var lines = new List<RawOrderLine>();

            // for random selection of 1 to 10 order lines (incl.)
            // one product per line
            var codes = 1 + random.Next(1023);
            var codeMask = 512; // idem
            var product = 9; // idem

            var orderLine = 0;
            lastOrderDate = lastOrderDate.AddMinutes(1 + random.Next(2));
            lastOrderNo++;

            while (codeMask > 0)
            {
                if ((codes & codeMask) > 0)
                {
                    lines.
                    Add
                    (
                        new RawOrderLine
                        {
                            OrderNo = lastOrderNo,
                            OrderLine = ++orderLine,
                            OrderDate = lastOrderDate,
                            StockCode = StockCodes[product],

                            // random quantity from 1 to 20 units (incl.)
                            Quantity = 1 + random.Next(20)
                        }
                    );
                }
                codeMask >>= 1;
                product--;
            }
            return lines;
        }

        public class CsvQuery
        {
            public IEnumerable<RawOrderLine> GetOrders(IEnumerable<RawOrderLine> csv, string stockCode, DateTime orderDate)
            {
                return
                    csv.Where(line => line.StockCode == stockCode && line.OrderDate == orderDate);
            }

            public IEnumerable<RawOrderLine> GetOrders(IEnumerable<RawOrderLine> csv, string stockCode, DateTime startDate, DateTime endDate)
            {
                return
                    csv.SkipWhile(line => line.OrderDate < startDate).
                        TakeWhile(line => line.OrderDate <= endDate).
                        Where(line => line.StockCode == stockCode);
            }

            public int GetQuantity(IEnumerable<RawOrderLine> csv, string stockCode, DateTime orderDate)
            {
                return
                    csv.Where(line => line.StockCode == stockCode && line.OrderDate == orderDate).
                        Aggregate
                        (
                            0,
                            (sum, line) => sum += line.Quantity
                        );
            }

            public int GetQuantity(IEnumerable<RawOrderLine> csv, string stockCode, DateTime startDate, DateTime endDate)
            {
                return
                    csv.SkipWhile(line => line.OrderDate < startDate).
                        TakeWhile(line => line.OrderDate < endDate).
                        Where(line => line.StockCode == stockCode).
                        Aggregate
                        (
                            0,
                            (sum, line) => sum += line.Quantity
                        );
            }
        }

        static void Main(string[] args)
        {
#if GENERATE_SAMPLE
            // create 3 millions of orders (of 1 to 10 order lines each)
            for (var i = 0; i < 3 * BaseOrderCount; i++)
            {
                var lines = NewRandomOrder();
                foreach (var line in lines)
                {
                    Console.WriteLine(line);
                }
            }
#else
            var csvReader = new RawOrderReader("data.csv");
            var csvQuery = new CsvQuery();

            Console.WriteLine();
            Console.WriteLine("Get last 3 days of Product9 orders... " + DateTime.Now);
            var lastThreeDaysOf2016Product9Orders =
                csvQuery.
                GetOrders
                (
                    csvReader.Data,
                    "Product9",
                    new DateTime(2016, 12, 29),
                    new DateTime(2016, 12, 31).AddDays(1).Subtract(new TimeSpan(0, 0, 1))
                ).
                ToArray();
            Console.WriteLine("... done @ " + DateTime.Now);

            Console.WriteLine();
            Console.WriteLine("Get last 3 days of Product9 quantities..." + DateTime.Now);
            var lastThreeDaysOf2016Product9Quantity =
                csvQuery.
                GetQuantity
                (
                    csvReader.Data,
                    "Product9",
                    new DateTime(2016, 12, 29),
                    new DateTime(2016, 12, 31).AddDays(1).Subtract(new TimeSpan(0, 0, 1))
                );
            Console.WriteLine("... done @ " + DateTime.Now);

            Console.WriteLine();
            Console.WriteLine("Details...");
            Console.WriteLine();
            foreach (var order in lastThreeDaysOf2016Product9Orders)
            {
                Console.WriteLine(order);
            }
            var quantityCheck = lastThreeDaysOf2016Product9Orders.Sum(order => order.Quantity);
            Console.WriteLine();
            Console.WriteLine("{0} = {1} ?", lastThreeDaysOf2016Product9Quantity, quantityCheck);
            Console.WriteLine();
#endif
            Console.ReadKey();
        }
    }
}

'Hope this helps.

Related Topic