C# – Delete Duplicate records from large csv file C# .Net

I have created a solution which read a large csv file currently 20-30 mb in size, I have tried to delete the duplicate rows based on certain column values that the user chooses at run time using the usual technique of finding duplicate rows but its so slow that it seems the program is not working at all.

What other technique can be applied to remove duplicate records from a csv file

Here's the code, definitely I am doing something wrong

DataTable dtCSV = ReadCsv(file, columns);
//columns is a list of string List column
DataTable dt=RemoveDuplicateRecords(dtCSV, columns);

private DataTable RemoveDuplicateRecords(DataTable dtCSV, List<string> columns)
        {
            DataView dv = dtCSV.DefaultView;
            string RowFilter=string.Empty;

            if(dt==null)
            dt = dv.ToTable().Clone();

            DataRow row = dtCSV.Rows[0];
            foreach (DataRow row in dtCSV.Rows)
            {
                try
                {
                    RowFilter = string.Empty;

                    foreach (string column in columns)
                    {
                        string col = column;
                        RowFilter += "[" + col + "]" + "='" + row[col].ToString().Replace("'","''") + "' and ";
                    }
                    RowFilter = RowFilter.Substring(0, RowFilter.Length - 4);
                    dv.RowFilter = RowFilter;
                    DataRow dr = dt.NewRow();
                    bool result = RowExists(dt, RowFilter);
                    if (!result)
                    {
                        dr.ItemArray = dv.ToTable().Rows[0].ItemArray;
                        dt.Rows.Add(dr);

                    }

                }
                catch (Exception ex)
                {
                }
            }
            return dt;
        }

HashSet<string> ScannedRecords = new HashSet<string>(); foreach (var row in dtCSV.Rows) { // Build a string that contains the combined column values StringBuilder sb = new StringBuilder(); foreach (string col in columns) { sb.AppendFormat("[{0}={1}]", col, row[col].ToString()); } // Try to add the string to the HashSet. // If Add returns false, then there is a prior record with the same values if (!ScannedRecords.Add(sb.ToString()) { // This record is a duplicate. } }

C# – Delete Duplicate records from large csv file C# .Net

Best Answer

Related Topic

Best Answer

Related Solutions

Dealing with commas in a CSV file

C# – Reading settings from app.config or web.config in .NET

Related Topic