Question

我是C＃的初学者，遇到了问题-列表很大（文本文件）超过1M行-结构为：

x_accel = (rawData[1]<<8 | rawData[0]);

我正在搜索带星号的行，以列出要删除的项目（带星号的项目除外）在上面的示例中，行：; 698563200209000258 698563200209000316 698563200225000019 698563200232000143 698563200235000199 698563200235000272 698563200240* 698563200293* 698563200301000511 698563200304000849 698563200316000696 698563200328000825 698563200240000833 698563200328000841 698563200328000866 698563200328000882 698563200328000916 698563200328000940 698563200239000957 698563200328000965 698563200239000973 698563200328000981的结果应该是：698563200293*和698563200239000957

我得到的代码是：

698563200239000973

大约需要1个小时才能完成（4核i7）-请帮助我加快速度。

有什么建议吗？

Answer 1

原始代码的问题在于，您要遍历文件中每一行的整个数据集，从而导致O（n平方）复杂度。

以下代码以O（2 * n + n log n）时间运行。不要忘记using System.IO;

var textFile = File.ReadAllLines(); // O(n)

List<string> fileLines = new List<string>(textFile);
List<string> fileListToRemove = new List<string>();

// start with dummy line not in file
string lastLineWithAsterisk="************";
int asteriskLocation;

// Sort the file O(n log n)
fileLines.Sort();

// iterate backwards. The * will sort directly after the numbers it matches. 
// O(n)
for (int i=fileLines.Count()-1; i>=0; i--)
{ 
     asteriskLocation = fileLines[i].IndexOf('*');
     if(asteriskLocation != -1)
         lastLineWithAsterisk =  fileLines[i].SubStr(0,asteriskLocation);
     else
         if(fileLines[i].StartsWith(lastLineWithAsterisk))
             fileListToRemove.Add(fileLines[i]);
}

也可以使用Parallel for循环（每个线程必须有一个单独的fileListToRemove并在最后将它们组合。）

编辑：要在不到一分钟的时间内解决原始的question，请使用以下代码：

var textFile = File.ReadAllLines(); // O(n)
var outFile = File.Create("C:\\outputfile.txt");
List<string> fileLines = new List<string>(textFile);


// start with dummy line not in file
string lastLineWithAsterisk="************";
int asteriskLocation;

// Sort the file O(n log n)
fileLines.Sort();

// iterate backwards. The * will sort directly after the numbers it matches. 
// O(n)
for (int i=fileLines.Count()-1; i>=0; i--)
{ 
     asteriskLocation = fileLines[i].IndexOf('*');
     if(asteriskLocation != -1)
     { 
       lastLineWithAsterisk=fileLines[i].SubStr(0,asteriskLocation);
     // Write the * lines
     outFile.WriteLine(fileLines[i]);
     }
     else
         // exclude matching lines
         if(!fileLines[i].StartsWith(lastLineWithAsterisk))
             outFile.WriteLine(fileLines[i]);

}
outFile.Close();
outFile.Dispose();

在此版本中，不需要并行化，因为限制因素是硬盘速度。

这假设输出顺序无关紧要。只需反转文件的顺序即可获得原始顺序。

Answer 2

第1步：对文本文件进行排序并保存，我们将从排序后的文本文件中进行搜索。创建新文件可能需要一些时间。新文件可能如下所示：

698563200209000258
698563200209000316
---some numbers---
698563200239*
698563200239000957
698563200239000973

第2步：

string lineWithAsterisk;
foreach (string currentString in fileHash)
{ 
     boolean found=i.Contains("*");
     /*
     alternative, you should try which code performs faster: 
     boolean found=i.charAt(i.Length-1).Equals("*");
     */
     if (found)
     {
         lineWithAsterisk = currentString;
         lineWithAsterisk=lineWithAsterisk.Remove(lineWithAsterisk.Length-1);
         continue;
     }
     if(currentString.StartsWith(lineWithAsterisk))
     {
         fileListToRemove.Add(currentString);
     }
}

我尚未测试代码，因此可能有错误。如果有发现，请发表评论。

Answer 3

由于您将只删除非星号行，因此可以通读文件并将这些行分成两组（用于匹配的星号行和其他可以删除的行）来简化此操作。

var asteriskLines = new HashSet<string>();
var otherLines = new List<string>();
var removeLines = new List<string>();

using (var testFile = new StreamReader("your file path"))
{
    string nextNumber;
    while ((nextNumber = testFile.ReadLine()) != null)
    {
        if (nextNumber.Contains("*"))
        {
            asteriskLines.Add(nextNumber.Substring(0, nextNumber.IndexOf('*')));
        }
        else
        {
            otherLines.Add(nextNumber);
        }
    }
}

foreach (string testNumber in otherLines)
{
    if (asteriskLines.Any(a => testNumber.StartsWith(a)))
    {
        removeLines.Add(testNumber);
    }
}

如果您想解决内存问题，可以只读取一次文件以构建asteriskLines，然后用另一个while循环替换otherLines foreach以再次逐行读取文件以使比较。您甚至可以将removeLines输出到另一个文件，而不是一个内存列表，如果该列表也可能会变大。

附录：

使用与OP类似的系统（4核i7，带有SSD），上述解决方案运行了不到15分钟的时间。进行以下更新：

var removeLines = new ConcurrentBag<string>();

...

Parallel.ForEach(otherLines, (testNumber, state, index) =>
{
    if (asteriskLines.Any(a => testNumber.StartsWith(a, StringComparison.Ordinal)))
    {
        removeLines.Add(testNumber);
    }
});

现在它持续运行不到一分钟。

（免责声明：我不知道为什么它要快得多。但是确实如此。结果保持一致，尽管顺序显然可能不同。）

在大型列表中搜索要删除的项目

3 个答案: