Java: Calculate SHA-256 hash of large file efficiently

hashjavaoptimizationperformancesha256

I need to calculate a SHA-256 hash of a large file (or portion of it). My implementation works fine, but its much slower than the C++'s CryptoPP calculation (25 Min. vs. 10 Min for ~30GB file). What I need is a similar execution time in C++ and Java, so the hashes are ready at almost the same time. I also tried the Bouncy Castle implementation, but it gave me the same result. Here is how I calculate the hash:

int buff = 16384;
try {
    RandomAccessFile file = new RandomAccessFile("T:\\someLargeFile.m2v", "r");

    long startTime = System.nanoTime();
    MessageDigest hashSum = MessageDigest.getInstance("SHA-256");

    byte[] buffer = new byte[buff];
    byte[] partialHash = null;

    long read = 0;

    // calculate the hash of the hole file for the test
    long offset = file.length();
    int unitsize;
    while (read < offset) {
        unitsize = (int) (((offset - read) >= buff) ? buff : (offset - read));
        file.read(buffer, 0, unitsize);

        hashSum.update(buffer, 0, unitsize);

        read += unitsize;
    }

    file.close();
    partialHash = new byte[hashSum.getDigestLength()];
    partialHash = hashSum.digest();

    long endTime = System.nanoTime();

    System.out.println(endTime - startTime);

} catch (FileNotFoundException e) {
    e.printStackTrace();
}

Best Answer

My explanation may not solve your problem since it depends a lot on your actual runtime environment, but when I run your code on my system, the throughput is limited by disk I/O and not the hash calculation. The problem is not solved by switching to NIO, but is simply caused by the fact that you're reading the file in very small pieces (16kB). Increasing the buffer size (buff) on my system to 1MB instead of 16kB more than doubles the throughput, but with >50MB/s, I am still limited by disk speed and not able to fully load a single CPU core.

BTW: You can simplify your implementation a lot by wrapping a DigestInputStream around a FileInputStream, read through the file and get the calculated hash from the DigestInputStream instead of manually shuffling the data from a RandomAccessFile to the MessageDigest as in your code.

I did a few performance tests with older Java versions and there seem to be a relevant difference between Java 5 and Java 6 here. I'm not sure though if the SHA implementation is optimized or if the VM is executing the code much faster. The throughputs I get with the different Java versions (1MB buffer) are:

Sun JDK 1.5.0_15 (client): 28MB/s, limited by CPU
Sun JDK 1.5.0_15 (server): 45MB/s, limited by CPU
Sun JDK 1.6.0_16 (client): 42MB/s, limited by CPU
Sun JDK 1.6.0_16 (server): 52MB/s, limited by disk I/O (85-90% CPU load)

I was a little bit curious on the impact of the assembler part in the CryptoPP SHA implementation, as the benchmarks results indicate that the SHA-256 algorithm only requires 15.8 CPU cycles/byte on an Opteron. I was unfortunately not able to build CryptoPP with gcc on cygwin (the build succeeded, but the generated exe failed immediately), but building a performance benchmark with VS2005 (default release configuration) with and without assembler support in CryptoPP and comparing to the Java SHA implementation on an in-memory buffer, leaving out any disk I/O, I get the following results on a 2.5GHz Phenom:

Sun JDK1.6.0_13 (server): 26.2 cycles/byte
CryptoPP (C++ only): 21.8 cycles/byte
CryptoPP (assembler): 13.3 cycles/byte

Both benchmarks compute the SHA hash of a 4GB empty byte array, iterating over it in chunks of 1MB, which are passed to MessageDigest#update (Java) or CryptoPP's SHA256.Update function (C++).

I was able to build and benchmark CryptoPP with gcc 4.4.1 (-O3) in a virtual machine running Linux and got only appr. half the throughput compared to the results from the VS exe. I am not sure how much of the difference is contributed to the virtual machine and how much is caused by VS usually producing better code than gcc, but I have no way to get any more exact results from gcc right now.

Read all text from a file

Java 11 added the readString() method to read small files as a String, preserving line terminators:

String content = Files.readString(path, StandardCharsets.US_ASCII);

For versions between Java 7 and 11, here's a compact, robust idiom, wrapped up in a utility method:

static String readFile(String path, Charset encoding)
  throws IOException
{
  byte[] encoded = Files.readAllBytes(Paths.get(path));
  return new String(encoded, encoding);
}

Read lines of text from a file

Java 7 added a convenience method to read a file as lines of text, represented as a List<String>. This approach is "lossy" because the line separators are stripped from the end of each line.

List<String> lines = Files.readAllLines(Paths.get(path), encoding);

Java 8 added the Files.lines() method to produce a Stream<String>. Again, this method is lossy because line separators are stripped. If an IOException is encountered while reading the file, it is wrapped in an UncheckedIOException, since Stream doesn't accept lambdas that throw checked exceptions.

try (Stream<String> lines = Files.lines(path, encoding)) {
  lines.forEach(System.out::println);
}

This Stream does need a close() call; this is poorly documented on the API, and I suspect many people don't even notice Stream has a close() method. Be sure to use an ARM-block as shown.

If you are working with a source other than a file, you can use the lines() method in BufferedReader instead.

Memory utilization

The first method, that preserves line breaks, can temporarily require memory several times the size of the file, because for a short time the raw file contents (a byte array), and the decoded characters (each of which is 16 bits even if encoded as 8 bits in the file) reside in memory at once. It is safest to apply to files that you know to be small relative to the available memory.

The second method, reading lines, is usually more memory efficient, because the input byte buffer for decoding doesn't need to contain the entire file. However, it's still not suitable for files that are very large relative to available memory.

For reading large files, you need a different design for your program, one that reads a chunk of text from a stream, processes it, and then moves on to the next, reusing the same fixed-sized memory block. Here, "large" depends on the computer specs. Nowadays, this threshold might be many gigabytes of RAM. The third method, using a Stream<String> is one way to do this, if your input "records" happen to be individual lines. (Using the readLine() method of BufferedReader is the procedural equivalent to this approach.)

Character encoding

One thing that is missing from the sample in the original post is the character encoding. There are some special cases where the platform default is what you want, but they are rare, and you should be able justify your choice.

The StandardCharsets class defines some constants for the encodings required of all Java runtimes:

String content = readFile("test.txt", StandardCharsets.UTF_8);

The platform default is available from the Charset class itself:

String content = readFile("test.txt", Charset.defaultCharset());

Note: This answer largely replaces my Java 6 version. The utility of Java 7 safely simplifies the code, and the old answer, which used a mapped byte buffer, prevented the file that was read from being deleted until the mapped buffer was garbage collected. You can view the old version via the "edited" link on this answer.

Best Answer

Related Solutions

Java – How to efficiently iterate over each entry in a Java Map

Java – How to create a Java string from the contents of a file

Read all text from a file

Read lines of text from a file

Memory utilization

Character encoding

Related Topic