有没有更好的方法来确定大型txt文件(1-2 GB)中的行数?

时间:2016-04-01 23:26:52

标签: c# .net

我正在尝试计算txt文件中的所有行,我正在使用public int countLines(string path) { var watch = System.Diagnostics.Stopwatch.StartNew(); int nlines=0; string line; StreamReader file = new StreamReader(path); while ((line = file.ReadLine()) != null) { nlines++; } watch.Stop(); var elapsedMs = watch.ElapsedMilliseconds; Console.Write(elapsedMs) // elapsedMs = 3520 --- Tested with a 1.2 Mill txt return nlines; }

{{1}}

是否有更有效的方法来计算行数?

2 个答案:

答案 0 :(得分:8)

我只是在这里大声思考,但可能性能是I / O绑定而不是CPU限制。在任何情况下,我都想知道将文件解释为文本是否会减慢速度,因为它必须在文件的编码和string的本机编码之间进行转换。如果您知道编码是ASCII或与ASCII兼容,您可能只需计算出值为10的字节的数量(这是换行的字符代码)。

如果您有以下内容怎么办?

FileStream fs = new FileStream("path.txt", FileMode.Open, FileAccess.Read, FileShare.None, 1024 * 1024);

long lineCount = 0;
byte[] buffer = new byte[1024 * 1024];
int bytesRead;

do
{
    bytesRead = fs.Read(buffer, 0, buffer.Length);
    for (int i = 0; i < bytesRead; i++)
        if (buffer[i] == '\n')
            lineCount++;
}
while (bytesRead > 0);

我的1.5GB文本文件的基准测试结果,定时10次,平均值:

  • StreamReader接近,4.69秒
  • File.ReadLines().Count()接近,4.54秒
  • FileStream接近, 1.46秒

答案 1 :(得分:4)

您已经拥有适当的解决方案,但您可以将所有代码简化为:

var lineCount = File.ReadLines(@"C:\MyHugeFile.txt").Count();

基准

我不确定dreamlax是如何实现他的基准测试结果的,但这里有人可以在他们的机器上重现;你可以复制粘贴到LINQPad。

首先让我们准备输入文件:

var filePath = @"c:\MyHugeFile.txt";

for (int counter = 0; counter < 5; counter++)
{
    var lines = new string[30000000];

    for (int i = 0; i < lines.Length; i++)
    {
        lines[i] = $"This is a line with a value of: {i}";
    }

    File.AppendAllLines(filePath, lines);
}

这应该产生一个1.5亿行文件,大约6 GB。

现在让我们运行每个方法:

void Main()
{
    var filePath = @"c:\MyHugeFile.txt";
    // Make sure you clear windows cache!
    UsingFileStream(filePath);

    // Make sure you clear windows cache!
    UsingStreamReaderLinq(filePath);

    // Make sure you clear windows cache!
    UsingStreamReader(filePath);
}

private void UsingFileStream(string path)
{
    var sw = Stopwatch.StartNew();
    using (var fs = new FileStream(path, FileMode.Open, FileAccess.Read))
    {
        long lineCount = 0;
        byte[] buffer = new byte[1024 * 1024];
        int bytesRead;

        do
        {
            bytesRead = fs.Read(buffer, 0, buffer.Length);
            for (int i = 0; i < bytesRead; i++)
                if (buffer[i] == '\n')
                    lineCount++;
        }
        while (bytesRead > 0);       
        Console.WriteLine("[FileStream] - Read: {0:n0} in {1}", lineCount, sw.Elapsed);
    }
}

private void UsingStreamReaderLinq(string path)
{
    var sw = Stopwatch.StartNew();
    var lineCount = File.ReadLines(path).Count();
    Console.WriteLine("[StreamReader+LINQ] - Read: {0:n0} in {1}", lineCount, sw.Elapsed);
}

private void UsingStreamReader(string path)
{
    var sw = Stopwatch.StartNew();
    long lineCount = 0;
    string line;
    using (var file = new StreamReader(path))
    {
        while ((line = file.ReadLine()) != null) { lineCount++; }
        Console.WriteLine("[StreamReader] - Read: {0:n0} in {1}", lineCount, sw.Elapsed);
    }
}

结果是:

  

[FileStream] - 阅读:00:00:37.3397443

中的150,000,000      

[StreamReader + LINQ] - 阅读:00:00 150,000,000:33.8842190

     

[StreamReader] - 阅读:00:00:34.2102178中的150,000,000

更新

使用优化ON运行会导致:

  

[FileStream] - 阅读:00:00:18.1636374中的150,000,000

     

[StreamReader + LINQ] - 阅读:00:00:33.3173354:150,000,000

     

[StreamReader] - 阅读:00:00:32.3530890中的150,000,000