C# application to read xml files, parsing them, storing locally in data structure and writing to csv

c

I have to develop a c# application that will read large xml files. XML files will have data about computer systems (i.e. Hardware, software, network, bios information). There will be separate xml file for each category i.e. hardware xml file, software xml file. A computer id is common attribute in all these xml files. It might be possible that a computer id present in one xml file but not in other. After the reading part is done, and storing data locally, it needs to be written in different csv i.e Hardware.csv, software.csv. Only those id's need to be written that are present in all xml files. I am thinking of the below approach to do this,

1) Read each xml file, store its computer id in data structure like Dictionary (with key being its computer id and value is Arraylist>). Each object will have associated attributes.
For Example – Hardware xml will be read and stored in dictionary as <1, hwarraylist>. Each hw object will have attributes as hwmodel, hwmanufacturer etc

2) While reading the second xml file i.e software xml file, if the id already exist in dictionary then a swarraylist will be appended to its value. Otherwise a new computerid is added to dictionary.

3) After all reading is done, I'll write arraylist of only those id's which have number of arraylist = number of xml files. Separate csv for separate arraylist.

Is this approach correct ?
Will it be feasible for millions of rows in each xml file ?

One more thing to add,
the xml file is as per the below format:

    <result_sets>
      <result_set>
<cs>
  <c>
    <wh>123</wh>
    <dn>Computer ID</dn>
    <rt>1</rt>
  </c>
  <c>
    <wh>112</wh>
    <dn>Name</dn>
    <rt>1</rt>
  </c>
  <c>
    <wh>124</wh>
    <dn>System</dn>
    <rt>1</rt>
  </c>
</cs>
<rs>
  <r>
    <id>8820</id>
    <cid>230</cid>
    <c>
      <v>230</v>
    </c>
    <c>
      <v>Windows</v>
    </c>
    <c>
      <v>system32</v>
    </c>
  </r>
</rs>

Is it possible to parse them using XMLReader or I should treat then as text files and do the processing.
Please advise

Best Answer

millions of rows - you have to understand the memory requirements of each row stored, if you have 1 million rows and each row is 1k (including object overheads) then that's a gig of RAM used. Strings can get big, so you might run into memory constraints quite easily.

What you are trying to do here is a classic merge problem (you should write it in COBOL - the classic COBOL program is one that takes a list of employees, and another list of work and merges them together to run payroll!) that works best if the input files are sorted.

If they were sorted in id order, then you can open each and read an entry from each, 1 at a time, if the IDs match write out a row to the output file, then repeat until all input is exhausted. Memory usage is minimal, efficiency is great (as you do not need to store and remember all IDs, they will come to you in order)

So I would work on a sorter for the xml inputs first.