Question

我创建了一个解决方案，它读取当前大小为20-30 mb的大型csv文件，我尝试使用用户在运行时选择的某些列值删除重复的行，使用查找重复行的常用技术但它太慢了，似乎该程序根本不起作用。

可以应用其他技术从csv文件中删除重复记录

这是代码，绝对是我做错了

DataTable dtCSV = ReadCsv(file, columns);
//columns is a list of string List column
DataTable dt=RemoveDuplicateRecords(dtCSV, columns);

private DataTable RemoveDuplicateRecords(DataTable dtCSV, List<string> columns)
        {
            DataView dv = dtCSV.DefaultView;
            string RowFilter=string.Empty;

            if(dt==null)
            dt = dv.ToTable().Clone();

            DataRow row = dtCSV.Rows[0];
            foreach (DataRow row in dtCSV.Rows)
            {
                try
                {
                    RowFilter = string.Empty;

                    foreach (string column in columns)
                    {
                        string col = column;
                        RowFilter += "[" + col + "]" + "='" + row[col].ToString().Replace("'","''") + "' and ";
                    }
                    RowFilter = RowFilter.Substring(0, RowFilter.Length - 4);
                    dv.RowFilter = RowFilter;
                    DataRow dr = dt.NewRow();
                    bool result = RowExists(dt, RowFilter);
                    if (!result)
                    {
                        dr.ItemArray = dv.ToTable().Rows[0].ItemArray;
                        dt.Rows.Add(dr);

                    }

                }
                catch (Exception ex)
                {
                }
            }
            return dt;
        }

Answer 1

执行此操作的一种方法是遍历表格，构建一个包含您感兴趣的组合列值的HashSet<string>。如果您尝试添加已经存在的字符串，那么您有一个重复的行。类似的东西：

HashSet<string> ScannedRecords = new HashSet<string>();

foreach (var row in dtCSV.Rows)
{
    // Build a string that contains the combined column values
    StringBuilder sb = new StringBuilder();
    foreach (string col in columns)
    {
        sb.AppendFormat("[{0}={1}]", col, row[col].ToString());
    }

    // Try to add the string to the HashSet.
    // If Add returns false, then there is a prior record with the same values 
    if (!ScannedRecords.Add(sb.ToString())
    {
        // This record is a duplicate.
    }
}

那应该非常快。

Answer 2

如果您已将排序例程实现为几个嵌套的for或foreach循环，则可以通过按要删除的列对数据进行排序来优化它，并简单地将每一行与您查看的最后一行进行比较。

发布一些代码是一种获得更好答案的可靠方法，但不知道你是如何实现它的，只要猜测就会得到它。

Answer 3

您是否尝试过在一个类中使用Linq来包装行？

Linq将为您提供获取不同值等的选项。

Answer 4

您当前正在为每一行创建一个字符串定义的过滤条件，然后针对整个表运行 - 这将会很慢。

采用Linq2Objects方法要好得多，在这种方法中，您依次将每一行读入一个类的实例，然后使用Linq Distinct运算符仅选择唯一的对象（非独特的对象将丢弃）。

代码看起来像：

from row in inputCSV.rows
select row.Distinct()

如果您不知道CSV文件将具有的字段，那么您可能需要稍微修改它 - 可能使用将CSV单元格读入每行的列表或字典的对象。

使用Linq从文件中读取对象时，某人或其他人的这篇文章可能有所帮助 - http://www.developerfusion.com/article/84468/linq-to-log-files/

Answer 5

根据您在问题中包含的新代码，我将提供第二个答案 - 我仍然更喜欢第一个答案，但如果您必须使用DataTable和DataRows，那么第二个答案可能有所帮助：

class DataRowEqualityComparer : IEqualityComparer<DataRow>
{
    public bool Equals(DataRow x, DataRow y)
    {
        // perform cell-by-cell comparison here
        return result;
    }

    public int GetHashCode(DataRow obj)
    {
        return base.GetHashCode();
    }
}

// ...

var comparer = new DataRowEqualityComparer();
var filteredRows = from row in dtCSV.Rows
                   select row.Distinct(comparer);

从大型csv文件C＃.Net中删除重复记录

5 个答案: