Question

我有以下代码进行优化。由于我希望文件很大，我没有使用HashMap存储行，而是选择了String数组。我尝试用大约500,000的n测试逻辑，它跑了大约14分钟。我肯定希望能比它快得多，并感谢任何帮助或建议。

         public static void RemoveDuplicateEntriesinFile(string filepath)
        {
              if (filepath == null)
                    throw new ArgumentException("Please provide a valid FilePath");
              String[] lines = File.ReadAllLines(filepath);
              for (int i = 0; i < lines.Length; i++)
              {
                    for (int j = (i + 1); j < lines.Length; j++)
                    {
                          if ((lines[i] !=null) && (lines[j]!=null) && lines[i].Equals(lines[j]))
                          {//replace duplicates with null
                                lines[j] = null;
                          }
                    }
              }

              File.WriteAllLines(filepath, lines);
        }

提前致谢！

Answer 1

“由于我希望文件很大，我没有使用HashMap存储行，而是选择了String数组。”

我不同意你的理由;文件越大，您从散列中获得的性能优势就越大。在您的代码中，您将每行与所有后续行进行比较，要求整个文件的计算复杂度为O（n²）。

另一方面，如果您使用有效的散列算法，那么每个散列查找都将在O（1）中完成;处理整个文件的计算复杂度变为O（n）。

尝试使用HashSet<string>并观察处理时间的差异：

public static void RemoveDuplicateEntriesinFile(string filepath)
{
    if (filepath == null)
        throw new ArgumentException("Please provide a valid FilePath");

    HashSet<string> hashSet = new HashSet<string>(File.ReadLines(filepath));
    File.WriteAllLines(filepath, hashSet);
}

编辑：您可以尝试以下版本的算法并检查需要多长时间吗？它经过优化，可最大限度地减少内存消耗：

HashAlgorithm hashAlgorithm = new SHA256Managed();
HashSet<string> hashSet = new HashSet<string>();
string tempFilePath = filepath + ".tmp";

using (var fs = new FileStream(tempFilePath, FileMode.Create, FileAccess.Write))
using (var sw = new StreamWriter(fs))
{
    foreach (string line in File.ReadLines(filepath))
    {
        byte[] lineBytes = Encoding.UTF8.GetBytes(line);
        byte[] hashBytes = hashAlgorithm.ComputeHash(lineBytes);
        string hash = Convert.ToBase64String(hashBytes);

        if (hashSet.Add(hash))
            sw.WriteLine(line);
    }
}

File.Delete(filepath);
File.Move(tempFilePath, filepath);

Answer 2

您可以尝试创建新列表并添加到其中。

        public static void RemoveDuplicateEntriesinFile(string filepath)
        {
              if (filepath == null)
                    throw new ArgumentException("Please provide a valid FilePath");
              String[] lines = File.ReadAllLines(filepath);
              List<String> newLines=new List<String>()
              foreach (string s in lines)
              {
                   if (newLines.Contains(s)
                   continue;
                   newLines.add(s);
              }
              //not sure if you can do this with a list, might have to convert back to array
              File.WriteAllLines(filepath, newLines);
        }

Answer 3

lines[j] = null;对我不起作用。 File.WriteAllLines(filepath, lines);将这些行写为“”（string.Empty）

优化重复文件条目删除方法

3 个答案: