我正在尝试计算txt文件中的所有行,我正在使用public int countLines(string path)
{
var watch = System.Diagnostics.Stopwatch.StartNew();
int nlines=0;
string line;
StreamReader file = new StreamReader(path);
while ((line = file.ReadLine()) != null)
{
nlines++;
}
watch.Stop();
var elapsedMs = watch.ElapsedMilliseconds;
Console.Write(elapsedMs)
// elapsedMs = 3520 --- Tested with a 1.2 Mill txt
return nlines;
}
:
{{1}}
是否有更有效的方法来计算行数?
答案 0 :(得分:8)
我只是在这里大声思考,但可能性能是I / O绑定而不是CPU限制。在任何情况下,我都想知道将文件解释为文本是否会减慢速度,因为它必须在文件的编码和string
的本机编码之间进行转换。如果您知道编码是ASCII或与ASCII兼容,您可能只需计算出值为10的字节的数量(这是换行的字符代码)。
如果您有以下内容怎么办?
FileStream fs = new FileStream("path.txt", FileMode.Open, FileAccess.Read, FileShare.None, 1024 * 1024);
long lineCount = 0;
byte[] buffer = new byte[1024 * 1024];
int bytesRead;
do
{
bytesRead = fs.Read(buffer, 0, buffer.Length);
for (int i = 0; i < bytesRead; i++)
if (buffer[i] == '\n')
lineCount++;
}
while (bytesRead > 0);
我的1.5GB文本文件的基准测试结果,定时10次,平均值:
StreamReader
接近,4.69秒File.ReadLines().Count()
接近,4.54秒FileStream
接近, 1.46秒 答案 1 :(得分:4)
您已经拥有适当的解决方案,但您可以将所有代码简化为:
var lineCount = File.ReadLines(@"C:\MyHugeFile.txt").Count();
我不确定dreamlax
是如何实现他的基准测试结果的,但这里有人可以在他们的机器上重现;你可以复制粘贴到LINQPad。
首先让我们准备输入文件:
var filePath = @"c:\MyHugeFile.txt";
for (int counter = 0; counter < 5; counter++)
{
var lines = new string[30000000];
for (int i = 0; i < lines.Length; i++)
{
lines[i] = $"This is a line with a value of: {i}";
}
File.AppendAllLines(filePath, lines);
}
这应该产生一个1.5亿行文件,大约6 GB。
现在让我们运行每个方法:
void Main()
{
var filePath = @"c:\MyHugeFile.txt";
// Make sure you clear windows cache!
UsingFileStream(filePath);
// Make sure you clear windows cache!
UsingStreamReaderLinq(filePath);
// Make sure you clear windows cache!
UsingStreamReader(filePath);
}
private void UsingFileStream(string path)
{
var sw = Stopwatch.StartNew();
using (var fs = new FileStream(path, FileMode.Open, FileAccess.Read))
{
long lineCount = 0;
byte[] buffer = new byte[1024 * 1024];
int bytesRead;
do
{
bytesRead = fs.Read(buffer, 0, buffer.Length);
for (int i = 0; i < bytesRead; i++)
if (buffer[i] == '\n')
lineCount++;
}
while (bytesRead > 0);
Console.WriteLine("[FileStream] - Read: {0:n0} in {1}", lineCount, sw.Elapsed);
}
}
private void UsingStreamReaderLinq(string path)
{
var sw = Stopwatch.StartNew();
var lineCount = File.ReadLines(path).Count();
Console.WriteLine("[StreamReader+LINQ] - Read: {0:n0} in {1}", lineCount, sw.Elapsed);
}
private void UsingStreamReader(string path)
{
var sw = Stopwatch.StartNew();
long lineCount = 0;
string line;
using (var file = new StreamReader(path))
{
while ((line = file.ReadLine()) != null) { lineCount++; }
Console.WriteLine("[StreamReader] - Read: {0:n0} in {1}", lineCount, sw.Elapsed);
}
}
结果是:
[FileStream] - 阅读:00:00:37.3397443
中的150,000,000[StreamReader + LINQ] - 阅读:00:00 150,000,000:33.8842190
[StreamReader] - 阅读:00:00:34.2102178中的150,000,000
使用优化ON
运行会导致:
[FileStream] - 阅读:00:00:18.1636374中的150,000,000
[StreamReader + LINQ] - 阅读:00:00:33.3173354:150,000,000
[StreamReader] - 阅读:00:00:32.3530890中的150,000,000