Four-Way XML Comparison in C#

ccomparisonnetxml

I have 4 XML files: A, B, C, and D. I want to know if the difference between A and B is the same as the difference between C and D.

The XML files are serializations of the same .NET object; one of the primary differences will be in a particular list that describes the features available on a particular product. (A description of the feature is itself another object).

All four have very similar structures, but there may be values present in one that aren't present in another, and some values may be changed. For example, if we consider document A:

<xmldoc>
   <a></a>
   <c></c>
   <d></d>
<xmldoc>

Document B:

<xmldoc>
   <a></a>
   <b></b> -- Added 
   <c></c> -- C and D are still ordered in the same way (except for the addition of <b>
   <d></d>
   <e></e> -- Also added, but it doesn't affect the sort of the other ones
<xmldoc>

Now suppose that I have the following documents. Document C is exactly identical to document A:

<xmldoc>
   <a></a>
   <c></c>
   <d></d>
<xmldoc>

Document D is identical to document B.

Since the difference between C and D is exactly the same as the difference between A and B, this should pass. However, suppose that instead we have document D as follows:

<xmldoc>
   <a></a>
   <b></b> 
   <f></f> <!-- Added -->
   <c></c>
   <d></d>
   <e></e>
   <f></f>
<xmldoc>

The difference between C and D is no longer the same as the difference between A and B.

I'm pretty sure that we won't have a case where document A shows up as:

<xmldoc>
   <c></c>
   <a></a> -- This is the same as the original document A except that this was reordered - this shouldn't happen
   <d></d>
<xmldoc>

My first thought was to use Microsoft's XML Diff Patch library, which compares two files and generates a DiffGram, which is an XML document that describes the difference between the two files being compared. My thought is that I could compare A to B to get DiffGram X and C to D to get DiffGram Y, and then do a third XML comparison between X and Y.

The idea sounds good on paper; unfortunately it's not turning out to be so simple. The difference between A and B is very similar to the difference between C and D, but X and Y look nothing like each other.

The problem is it gives DiffGrams like the following:

<xd:node match="4">
           <xd:node match="2">
              <xd:node match="1">
                 <xd:remove match="1-3" />
              </xd:node>
           </xd:node>

           <xd:node match="1">
              <xd:node match="1">
                 <xd:remove match="1-3" />
              </xd:node>
           </xd:node>
        </xd:node>

This has two problems: first, it's extremely cryptic – I'd prefer it if it was more human-readable, but it's not the end of the world if that's not the case (since my primary purpose is programmatic here). Secondly (and much more critically), it seems like that's very tightly coupled to the specific XML files that are in that particular comparison.

I originally posted on the Software Recommendation Stack Exchange asking for recommendations for a .NET library (preferably a available as a NuGet package) that would be suitable for this purpose but didn't have much luck getting a recommendation. (Full disclosure: I haven't deleted that question yet but intend to do so shortly). If such a library exists, I haven't been able to find it (a lot of them seem like they're not designed for the purpose I want to use them for and/or aren't written for the .NET framework), but if anyone's aware of such a library that would definitely be an acceptable solution as well (in fafct, I would strongly prefer that to having to implement it myself).

Has anyone successfully done something like this (either by creating your own solution, using Microsoft's XML Diff library, or using another third-party library)? If so, what did you do?

I'm hoping that this isn't too broad of a question (if so let me know and I'll edit), but what would be a good approach to this if I end up writing this myself?

Best Answer

My thought is that I could compare A to B to get DiffGram X and C to D to get DiffGram Y, and then do a third XML comparison between X and Y.

That seems to be a good start. I guess what is missing here is something like a program or xslt script to transform "DiffGram X" to a readable representation X'. Then you can apply the same transformation to Diffgram Y, leading to a readable Y'. Comparing X' and Y' gives you a final DiffGram Z, which might be transformed to a readable Z'.

How this script or program will loook like probably depends on what kind of assumptions you can make about the structure of the input files. Do they really consist of arbitrary nested XML trees? Do you need to compare attributes, name space differences elements and element texts as well? I would be astonished if one cannot use that knowledge to simplify the DiffGrams.

Related Topic