Question

我的代码是：

    int linenumber = File.ReadLines(path).Count();

但是大约1 gig的文件需要很长时间（约20秒）。

所以有人知道更好的方法来解决这个问题吗？

更新6：

我已经测试了您的解决方案：

对于大小为870 mb的文件：

方法1：{my code time(seconds) : 13}

方法2 :(来自MarcinJuraszek & Locke）（相同）{

time(seconds) : 12}

方法3 :(来自Richard Deeming）{time(seconds) : 19}

方法4 :(来自user2942249）{time(seconds) : 13}

方法5 :(来自Locke）{time(seconds) : 13与lineBuffer = {4096 , 8192 , 16384 , 32768} }相同

方法6 :(来自Locke edition 2）time(seconds) : 9 Buffer size = 32KB，time(seconds) : 10 Buffer size = 64KB}

正如我所说，在我的评论中，有一个应用程序（native code），可以在5 second中的电脑中打开此文件。因此这是not about h.d.d speed。

By Compiling MSIL to Native Code，差异was not obvious。

Conclusion：目前，Locke method 2比faster高出其他方法。

所以我将他的帖子标记为Answer。但如果任何人find better idea，这篇文章将会公开。

我为帮助我的亲爱的朋友vote up提供了+1 to solve the problem。

感谢您的帮助。有趣的更好主意。最好的祝福聪明人

Answer 1

以下是一些可以快速完成的方法：

<强>的StreamReader：

using (var sr = new StreamReader(path))
{
    while (!String.IsNullOrEmpty(sr.ReadLine()))
        lineCount ++;
}

<强>的FileStream：

var lineBuffer = new byte[65536]; // 64Kb
using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read,
       FileShare.Read, lineBuffer.Length))
{
    int readBuffer = 0;
    while ((readBuffer = fs.Read(lineBuffer, 0, lineBuffer.Length)) > 0)
    {
        for (int i = 0; i < readBuffer; i++)
        {
            if (lineBuffer[i] == 0xD) // Carriage return + line feed
                lineCount++;
        }
    }
}

<强>多线程：

可以说线程的数量不应该影响读取速度，但实际的基准测试有时可以证明不是这样。尝试不同的缓冲区大小，看看你的设置是否有任何收益。 *此方法包含竞争条件。请谨慎使用。

var tasks = new Task[Environment.ProcessorCount]; // 1 per core
var fileLock = new ReaderWriterLockSlim();
int bufferSize = 65536; // 64Kb

using (FileStream fs = new FileStream(path, FileMode.Open, FileAccess.Read,
        FileShare.Read, bufferSize, FileOptions.RandomAccess))
{
    for (int i = 0; i < tasks.Length; i++)
    {
        tasks[i] = Task.Factory.StartNew(() =>
            {
                int readBuffer = 0;
                var lineBuffer = new byte[bufferSize];

                while ((fileLock.TryEnterReadLock(10) && 
                       (readBuffer = fs.Read(lineBuffer, 0, lineBuffer.Length)) > 0))
                {
                    fileLock.ExitReadLock();
                    for (int n = 0; n < readBuffer; n++)
                        if (lineBuffer[n] == 0xD)
                            Interlocked.Increment(ref lineCount);
                }
            });
    }
    Task.WaitAll(tasks);
}

Answer 2

假设构建一个字符串来表示每一行是花费时间的，这样的事情可能会有所帮助：

public static int CountLines1(string path)
{
   int lineCount = 0;
   bool skipNextLineBreak = false;
   bool startedLine = false;
   var buffer = new char[16384];
   int readChars;

   using (var stream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, buffer.Length))
   using (var reader = new StreamReader(stream, Encoding.UTF8, false, buffer.Length, false))
   {
      while ((readChars = reader.Read(buffer, 0, buffer.Length)) > 0)
      {
         for (int i = 0; i < readChars; i++)
         {
            switch (buffer[i])
            {
               case '\n':
               {
                  if (skipNextLineBreak)
                  {
                     skipNextLineBreak = false;
                  }
                  else
                  {
                     lineCount++;
                     startedLine = false;
                  }
                  break;
               }
               case '\r':
               {
                  lineCount++;
                  skipNextLineBreak = true;
                  startedLine = false;
                  break;
               }
               default:
               {
                  skipNextLineBreak = false;
                  startedLine = true;
                  break;
               }
            }
         }
      }
   }

   return startedLine ? lineCount + 1 : lineCount;
}

编辑2：
他们所说的“假设”是真的！为每个字符调用.Read()的开销超过了不为每一行创建字符串所带来的节省。即使更新代码一次读取一个字符块仍然比原始方法慢。

Answer 3

它取决于硬件，一个问题是什么是最佳缓冲区大小。也许是等于磁盘扇区大小或更大的东西。在试验自己之后，我发现通常最好让系统确定。如果速度真的是一个问题，你可以下载到Win32 API ReadFile / CreateFile指定各种标志和参数，如异步IO和无缓冲，顺序读取等...这可能有助于提高性能，也可能没有帮助。您必须分析并查看在您的系统上最有效的方法。在.NET中，您可以将缓冲区固定以获得更好的性能，当然，在GC环境中固定内存还有其他后果，但是如果你不保留它太长时间等等......

    const int bufsize = 4096;
    int lineCount = 0;
    Byte[] buffer = new Byte[bufsize];
    using (System.IO.FileStream fs = new System.IO.FileStream(@"C:\\data\\log\\20111018.txt", FileMode.Open, FileAccess.Read, FileShare.None, bufsize))
    {
        int totalBytesRead = 0;
        int bytesRead;
        while ((bytesRead = fs.Read(buffer, 0, buffer.Length)) > 0) {
            int i = 0;
            while (i < bytesRead)
            {
                switch (buffer[i])
                {
                    case 10:
                        {
                            lineCount++;
                            i++;
                            break;
                        }
                    case 13:
                        {
                            int index = i + 1;
                            if (index < bytesRead)
                            {
                                if (buffer[index] == 10)
                                {
                                    lineCount++;
                                    i += 2;
                                }
                            }
                            else
                            {
                                i++;
                            }
                            break;
                        }
                    default:
                        {
                            i++;
                            break;
                        }
                }
            }
            totalBytesRead += bytesRead;
        }
        if ((totalBytesRead > 0) && (lineCount == 0))
            lineCount++;                    
    }

Answer 4

正如您的测试所示，代码更改不会对速度产生重大影响。瓶颈在于您的磁盘读取数据，而不是处理它的C＃代码。

如果您想加快执行此任务的速度，请购买更快/更好的硬盘驱动器，无论是具有更高RPM的硬盘，还是固态硬盘。或者，您可以考虑使用RAID0，这可能会提高您的磁盘读取速度。

另一个选择是拥有多个硬盘驱动器，并且要分解文件以便每个驱动器存储一个部分，然后您可以使用处理每个驱动器上的文件的一个任务来并行化工作。（请注意，当您只有一个磁盘时并行化工作对任何事情都无济于事，而且实际上更容易受到伤害。）

计算c＃中文本文件总数的最快方法是什么？

4 个答案: