Question

我有一个非常简单的文本文件解析应用程序，用于搜索电子邮件地址，如果找到则会添加到列表中。

目前列表中存在重复的电子邮件地址，我正在寻找一种快速修改列表以仅包含不同值的方法 - 而不是逐个迭代它们：）

这是代码 -

var emailLines = new List<string>();
using (var stream = new StreamReader(@"C:\textFileName.txt"))
{
    while (!stream.EndOfStream)
    {
        var currentLine = stream.ReadLine();

        if (!string.IsNullOrEmpty(currentLine) && currentLine.StartsWith("Email: "))
        {
            emailLines.Add(currentLine);
        }
    }
}

Answer 1

如果您只需要独特的商品，则可以将商品添加到HashSet而不是List。请注意，HashSet没有隐含的顺序。如果您需要有序集，则可以使用SortedSet代替。

var emailLines = new HashSet<string>();

然后就没有重复了。

要从List删除重复项，您可以使用IEnumerable.Distinct()：

IEnumerable<string> distinctEmails = emailLines.Distinct();

Answer 2

尝试以下

var emailLines = File.ReadAllLines(@"c:\textFileName.txt")
  .Where(x => !String.IsNullOrEmpty(x) && x.StartsWith("Email: "))
  .Distinct()
  .ToList();

这种方法的缺点是它将文件中的所有行都读成string[]。这会立即发生，对于大文件将创建相应大的数组。通过使用简单的迭代器可以获得对行的惰性读取。

public static IEnumerable<string> ReadAllLinesLazy(string path) { 
  using ( var stream = new StreamReader(path) ) {
    while (!stream.EndOfStream) {
      yield return stream.ReadLine();
    }
  }
}

上面的File.ReadAllLines调用只能替换为对此函数的调用

Answer 3

IEnumerable / Linq goodness（非常适合大文件 - 只有匹配的行保存在内存中）：

// using System.Linq;

var emailLines = ReadFileLines(@"C:\textFileName.txt")
    .Where(line => currentLine.StartsWith("Email: "))
    .Distinct()
    .ToList();

public IEnumerable<string> ReadFileLines(string fileName)
{
    using (var stream = new StreamReader(fileName))
    {
        while (!stream.EndOfStream)
        {
            yield return stream.ReadLine();
        }
    }
}

从字符串列表中获取唯一项

3 个答案: