Get a hash from XML

comparisonmultilingualxml

I need to get the same hash of an xml in any language.

I tried to get the xml's canonical form then get it's hash

But what I experienced was that the canonical is not a "fixed standard". It is implemented in different forms by all the libs and languages that I worked with… so, I never get the SAME hash.

So, my question is: is there a way to get a trustable hash of the same canonical XML?

Edit

To get the Canonical Form, I've used:

In c# (.net version 8):
string stringXml = "<doc xmlns=\"http://www.ietf.org\" xmlns:w3c=\"http://www.w3.org\" xml:base=\"something/else\">\n   <e1>\n      <e2 xmlns=\"\" xml:id=\"abc\" xml:base=\"bar/\">\n         <e3 id=\"E3\" xml:base=\"foo\"/>\n      </e2>\n   </e1>\n</doc>";

System.Security.Cryptography.Xml.XmlDsigC14NWithCommentsTransform c14n = new();

System.Xml.XmlDocument documentXml = new();
documentXml.LoadXml(stringXml);
c14n.LoadInput(documentXml);

Stream stream = (Stream)c14n.GetOutput(typeof(Stream));
string result = new StreamReader(stream).ReadToEnd();

using var hash = System.Security.Cryptography.SHA256.Create();
var byteArray = hash.ComputeHash(System.Text.Encoding.UTF8.GetBytes(result));
string sha256hex = Convert.ToHexString(byteArray);

Console.WriteLine(sha256hex);
  • The result was: 4716238DE66819B69981AE1BD3943451D0EADEEA001583D27CDFDC4255484CB6
In java (version 21):
String stringXml= "<doc xmlns=\"http://www.ietf.org\" xmlns:w3c=\"http://www.w3.org\" xml:base=\"something/else\">\n   <e1>\n      <e2 xmlns=\"\" xml:id=\"abc\" xml:base=\"bar/\">\n         <e3 id=\"E3\" xml:base=\"foo\"/>\n      </e2>\n   </e1>\n</doc>";

org.apache.xml.security.Init.init();
org.apache.xml.security.c14n.Canonicalizer c14n = org.apache.xml.security.c14n.Canonicalizer.getInstance(org.apache.xml.security.c14n.Canonicalizer.ALGO_ID_C14N_WITH_COMMENTS);

java.io.ByteArrayOutputStream stream = new java.io.ByteArrayOutputStream();
c14n.canonicalize(stringXml.getBytes(), stream, false);
String result = stream.toString(java.nio.charset.StandardCharsets.UTF_8);
String sha256hex = org.apache.commons.codec.digest.DigestUtils.sha256Hex(result);
System.out.println(sha256hex);
  • The result was: dea874fbbe21f9e27e521cfddf61aa54bc1b0b18692e3105455eeca24beea1f6

I'm using this xml as example:

<doc xmlns="http://www.ietf.org" xmlns:w3c="http://www.w3.org" xml:base="something/else">
   <e1>
      <e2 xmlns="" xml:id="abc" xml:base="bar/">
         <e3 id="E3" xml:base="foo"/>
      </e2>
   </e1>
</doc>

Best Answer

Since each library that produces a canonical XML document must implement a standard, you will get minor differences. This will break any hashing algorithm, because the foundational information is different.

This will take some experimentation, but here is how I would approach this problem:

  • Run the XML document through your own preferred library, even if the XML document is already in its canonical form. This should eliminate inconsistencies introduced by differing interpretations of the same standard.

  • Attempt to compare hashes by running them through several different libraries to produce various canonical forms. If one matches, call it good.

The limited research I've done on this topic indicates that there are still wonky edge cases where the canonical form produced from two different files using the same library will generate different hashes. I'm not certain what those wonky edge cases are, but generally you need to define a process to "catch" the files that don't produce the same hash. Then you will need to develop some sort of process to deal with those mismatches, which is not something we could help you with. This likely requires in depth knowledge of your business processes.

You might consider using XML signatures, which can provide more information about how to compute the hash, including which canonical form was used, which cryptographic algorithm, etc.

Related Topic