C# – What’s the fastest way to read a text file line-by-line

cfile-ionetperformancetext-files

I want to read a text file line by line. I wanted to know if I'm doing it as efficiently as possible within the .NET C# scope of things.

This is what I'm trying so far:

var filestream = new System.IO.FileStream(textFilePath,
                                          System.IO.FileMode.Open,
                                          System.IO.FileAccess.Read,
                                          System.IO.FileShare.ReadWrite);
var file = new System.IO.StreamReader(filestream, System.Text.Encoding.UTF8, true, 128);

while ((lineOfText = file.ReadLine()) != null)
{
    //Do something with the lineOfText
}

Best Answer

To find the fastest way to read a file line by line you will have to do some benchmarking. I have done some small tests on my computer but you cannot expect that my results apply to your environment.

Using StreamReader.ReadLine

This is basically your method. For some reason you set the buffer size to the smallest possible value (128). Increasing this will in general increase performance. The default size is 1,024 and other good choices are 512 (the sector size in Windows) or 4,096 (the cluster size in NTFS). You will have to run a benchmark to determine an optimal buffer size. A bigger buffer is - if not faster - at least not slower than a smaller buffer.

const Int32 BufferSize = 128;
using (var fileStream = File.OpenRead(fileName))
  using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize)) {
    String line;
    while ((line = streamReader.ReadLine()) != null)
      // Process line
  }

The FileStream constructor allows you to specify FileOptions. For example, if you are reading a large file sequentially from beginning to end, you may benefit from FileOptions.SequentialScan. Again, benchmarking is the best thing you can do.

Using File.ReadLines

This is very much like your own solution except that it is implemented using a StreamReader with a fixed buffer size of 1,024. On my computer this results in slightly better performance compared to your code with the buffer size of 128. However, you can get the same performance increase by using a larger buffer size. This method is implemented using an iterator block and does not consume memory for all lines.

var lines = File.ReadLines(fileName);
foreach (var line in lines)
  // Process line

Using File.ReadAllLines

This is very much like the previous method except that this method grows a list of strings used to create the returned array of lines so the memory requirements are higher. However, it returns String[] and not an IEnumerable<String> allowing you to randomly access the lines.

var lines = File.ReadAllLines(fileName);
for (var i = 0; i < lines.Length; i += 1) {
  var line = lines[i];
  // Process line
}

Using String.Split

This method is considerably slower, at least on big files (tested on a 511 KB file), probably due to how String.Split is implemented. It also allocates an array for all the lines increasing the memory required compared to your solution.

using (var streamReader = File.OpenText(fileName)) {
  var lines = streamReader.ReadToEnd().Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
  foreach (var line in lines)
    // Process line
}

My suggestion is to use File.ReadLines because it is clean and efficient. If you require special sharing options (for example you use FileShare.ReadWrite), you can use your own code but you should increase the buffer size.

Read all text from a file

Java 11 added the readString() method to read small files as a String, preserving line terminators:

String content = Files.readString(path, StandardCharsets.US_ASCII);

For versions between Java 7 and 11, here's a compact, robust idiom, wrapped up in a utility method:

static String readFile(String path, Charset encoding)
  throws IOException
{
  byte[] encoded = Files.readAllBytes(Paths.get(path));
  return new String(encoded, encoding);
}

Read lines of text from a file

Java 7 added a convenience method to read a file as lines of text, represented as a List<String>. This approach is "lossy" because the line separators are stripped from the end of each line.

List<String> lines = Files.readAllLines(Paths.get(path), encoding);

Java 8 added the Files.lines() method to produce a Stream<String>. Again, this method is lossy because line separators are stripped. If an IOException is encountered while reading the file, it is wrapped in an UncheckedIOException, since Stream doesn't accept lambdas that throw checked exceptions.

try (Stream<String> lines = Files.lines(path, encoding)) {
  lines.forEach(System.out::println);
}

This Stream does need a close() call; this is poorly documented on the API, and I suspect many people don't even notice Stream has a close() method. Be sure to use an ARM-block as shown.

If you are working with a source other than a file, you can use the lines() method in BufferedReader instead.

Memory utilization

The first method, that preserves line breaks, can temporarily require memory several times the size of the file, because for a short time the raw file contents (a byte array), and the decoded characters (each of which is 16 bits even if encoded as 8 bits in the file) reside in memory at once. It is safest to apply to files that you know to be small relative to the available memory.

The second method, reading lines, is usually more memory efficient, because the input byte buffer for decoding doesn't need to contain the entire file. However, it's still not suitable for files that are very large relative to available memory.

For reading large files, you need a different design for your program, one that reads a chunk of text from a stream, processes it, and then moves on to the next, reusing the same fixed-sized memory block. Here, "large" depends on the computer specs. Nowadays, this threshold might be many gigabytes of RAM. The third method, using a Stream<String> is one way to do this, if your input "records" happen to be individual lines. (Using the readLine() method of BufferedReader is the procedural equivalent to this approach.)

Character encoding

One thing that is missing from the sample in the original post is the character encoding. There are some special cases where the platform default is what you want, but they are rare, and you should be able justify your choice.

The StandardCharsets class defines some constants for the encodings required of all Java runtimes:

String content = readFile("test.txt", StandardCharsets.UTF_8);

The platform default is available from the Charset class itself:

String content = readFile("test.txt", Charset.defaultCharset());

Note: This answer largely replaces my Java 6 version. The utility of Java 7 safely simplifies the code, and the old answer, which used a mapped byte buffer, prevented the file that was read from being deleted until the mapped buffer was garbage collected. You can view the old version via the "edited" link on this answer.

Python – How to get line count of a large file cheaply in Python

One line, probably pretty fast:

num_lines = sum(1 for line in open('myfile.txt'))

Best Answer

Related Solutions

Java – How to create a Java string from the contents of a file

Read all text from a file

Read lines of text from a file

Memory utilization

Character encoding

Python – How to get line count of a large file cheaply in Python

Related Topic