Question

我需要以行方式处理大量的中型到大型文件（几百MB到GB），所以我对用于迭代线的标准D方法感兴趣。 foreach(line; file.byLine())成语似乎符合要求，并且令人愉快，简洁易读，但性能似乎不太理想。

例如，下面是Python和D中的两个简单程序，用于迭代文件行并计算行数。对于~470 MB的文件（~3.6M行），我得到以下时间（最好的10个）：

D次：

real    0m19.146s
user    0m18.932s
sys     0m0.190s

Python时间（在编辑2 之后，见下文）：

real    0m0.924s
user    0m0.792s
sys     0m0.129s

以下是使用dmd -O -release -inline -m64编译的D版本：

import std.stdio;
import std.string;

int main(string[] args)
{
  if (args.length < 2) {
    return 1;
  }
  auto infile = File(args[1]);
  uint linect = 0;
  foreach (line; infile.byLine())
    linect += 1;
  writeln("There are: ", linect, " lines.");
  return 0;
}

现在相应的Python版本：

import sys

if __name__ == "__main__":
    if (len(sys.argv) < 2):
        sys.exit()
    infile = open(sys.argv[1])
    linect = 0
    for line in infile:
        linect += 1
    print "There are %d lines" % linect

编辑2 ：我改变了Python代码，使用了下面评论中建议的更惯用的for line in infile，从而为Python版本带来了更大的加速，这是现在接近标准wc -l调用Unix wc工具的速度。

在D中我可能做错了什么的建议或指示，那就是表现不佳？

编辑：为了进行比较，这里是一个D版本，它将byLine()成语抛出窗口并立即将所有数据吸入内存，然后将数据拆分成行后发布-hoc。这提供了更好的性能，但仍然比Python版本慢约2倍。

import std.stdio;
import std.string;
import std.file;

int main(string[] args)
{
  if (args.length < 2) {
    return 1;
  }
  auto c = cast(string) read(args[1]);
  auto l = splitLines(c);
  writeln("There are ", l.length, " lines.");
  return 0;
}

最后一个版本的时间如下：

real    0m3.201s
user    0m2.820s
sys     0m0.376s

Answer 1

EDIT AND TL; DR：此问题已在https://github.com/D-Programming-Language/phobos/pull/3089中解决。从D 2.068开始，将提供改进的File.byLine性能。

我在一个包含575247行的文本文件上尝试了您的代码。 Python基线大约需要0.125秒。这是我的代码库，每个方法的注释中都嵌入了时序。解释如下。

import std.algorithm, std.file, std.stdio, std.string;

int main(string[] args)
{
  if (args.length < 2) {
    return 1;
  }
  size_t linect = 0;

  // 0.62 s
  foreach (line; File(args[1]).byLine())
    linect += 1;

  // 0.2 s
  //linect = args[1].readText.count!(c => c == '\n');

  // 0.095 s
  //linect = args[1].readText.representation.count!(c => c == '\n');

  // 0.11 s
  //linect = File(args[1]).byChunk(4096).joiner.count!(c => c == '\n');

  writeln("There are: ", linect, " lines.");
  return 0;
}

我为每个变体使用dmd -O -release -inline。

第一个版本（最慢）一次读取一行。我们可以而且应该提高byLine的性能;目前它被诸如byLine与其他C stdio操作的混合使用所困扰，这可能过于保守。如果我们废除它，我们可以轻松地进行预取等。

第二个版本一下子读取文件，然后使用标准算法用谓词计算行。

第三个版本承认没有必要考虑任何UTF细微之处;计数字节也一样好，所以它将字符串转换为字节方式（免费），然后计算字节数。

最后一个版本（我的fave）一次从文件中读取4KB数据，并使用joiner懒惰地展平它们。然后它再次计算字节数。

Answer 2

我以为今天我会做点新事，所以我决定“学习”D.请注意这是我写的第一个D，所以我可能会完全离开。

我尝试的第一件事是手动缓冲：

foreach (chunk; infile.byChunk(100000)) {
    linect += splitLines(cast(string) chunk).length;
}

请注意，这是不正确的，因为它会忽略跨越边界的线条，但稍后会修复它。

这有点帮助，但还不够。它允许我测试

foreach (chunk; infile.byChunk(100000)) {
    linect += (cast(string) chunk).length;
}

表明所有时间都在splitLines。

我制作了splitLines的本地副本。仅此一项就将速度提高了2倍！我没想到这个。我正在运行

dmd -release -inline -O -m64 -boundscheck=on
dmd -release -inline -O -m64 -boundscheck=off

两种方式大致相同。

然后我重写了splitLines专门研究s[i].sizeof == 1，它现在似乎比Python慢，因为它也会破坏段落分隔符。

为了完成它，我制作了一个Range并进一步优化它，这使代码接近Python的速度。考虑到Python不会破坏段落分隔符并且其底层代码是用C语言编写的，这似乎没问题。此代码可能在长度超过8k的行上具有O(n²)性能，但我不确定。

import std.range;
import std.stdio;

auto lines(File file, KeepTerminator keepTerm = KeepTerminator.no) {
    struct Result {
        public File.ByChunk chunks;
        public KeepTerminator keepTerm;
        private string nextLine;
        private ubyte[] cache;

        this(File file, KeepTerminator keepTerm) {
            chunks = file.byChunk(8192);
            this.keepTerm = keepTerm;

            if (chunks.empty) {
                nextLine = null;
            }
            else {
                // Initialize cache and run an
                // iteration to set nextLine
                popFront;
            }
        }

        @property bool empty() {
            return nextLine is null;
        }

        @property auto ref front() {
            return nextLine;
        }

        void popFront() {
            size_t i;
            while (true) {
                // Iterate until we run out of cache
                // or we meet a potential end-of-line
                while (
                    i < cache.length &&
                    cache[i] != '\n' &&
                    cache[i] != 0xA8 &&
                    cache[i] != 0xA9
                ) {
                    ++i;
                }

                if (i == cache.length) {
                    // Can't extend; just give the rest
                    if (chunks.empty) {
                        nextLine = cache.length ? cast(string) cache : null;
                        cache = new ubyte[0];
                        return;
                    }

                    // Extend cache
                    cache ~= chunks.front;
                    chunks.popFront;
                    continue;
                }

                // Check for false-positives from the end-of-line heuristic
                if (cache[i] != '\n') {
                    if (i < 2 || cache[i - 2] != 0xE2 || cache[i - 1] != 0x80) {
                        continue;
                    }
                }

                break;
            }

            size_t iEnd = i + 1;
            if (keepTerm == KeepTerminator.no) {
                // E2 80 A9 or E2 80 A9
                if (cache[i] != '\n') {
                    iEnd -= 3;
                }
                // \r\n
                else if (i > 1 && cache[i - 1] == '\r') {
                    iEnd -= 2;
                }
                // \n
                else {
                    iEnd -= 1;
                }
            }

            nextLine = cast(string) cache[0 .. iEnd];
            cache = cache[i + 1 .. $];
        }
    }

    return Result(file, keepTerm);
}

int main(string[] args)
{
    if (args.length < 2) {
        return 1;
    }

    auto file = File(args[1]);
    writeln("There are: ", walkLength(lines(file)), " lines.");

    return 0;
}

Answer 3

计算行是否是文本处理应用程序中整体性能的良好代理值得商榷。您正在测试python的C库的效率，以及其他任何东西，并且一旦您真正开始使用数据做有用的事情，您将得到不同的结果。 D比Python更少的时间来磨练标准库，并且涉及的人数更少。 byLine的性能已经讨论了几年了，我认为下一个版本会更快。

人们似乎确实发现D对于这种类型的文本处理是高效且高效的。例如，AdRoll以蟒蛇商店而闻名，但他们的数据科学家使用D：

http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

回到这个问题，一个显然是比较编译器和库，就像一个是语言。 DMD的作用是作为参考编译器，以及快速编译闪电的编译器。所以它非常适合快速开发和迭代，但是如果你需要速度，那么你应该使用LDC或GDC，如果你使用DMD，那么打开优化并关闭边界检查。

在我的arch linux 64位HP Probook 4530s机器上，使用WestburyLab usenet语料库的最后1mm行，我得到以下内容：

python2：real 0m0.333s，user 0m0.253s，sys 0m0.013s

pypy（热身）：真正的0m0.286s，用户0m0.250s，sys 0m0.033s

DMD（默认值）：实际0m0.468s，用户0m0.460s，sys 0m0.007s

DMD（-O -release -inline -noboundscheck）：实际0m0.398s，用户0m0.393s，sys 0m0.003s

GDC（默认值）：实际0m0.400s，用户0m0.380s，sys 0m0.017s [我不知道GDC优化的开关]

LDC（默认值）：实际0m0.396s，用户0m0.380s，sys 0m0.013s

LDC（-O5）：实际0m0.336s，用户0m0.317s，sys 0m0.017s

在一个真实的应用程序中，人们将使用内置的分析器识别热点并调整代码，但我同意天真的D应该是不错的速度，最糟糕的是与python相同的球场。并且使用LDC进行优化，这确实是我们所看到的。

为了完整起见，我将您的D代码更改为以下内容。（有些进口不需要 - 我在玩耍）。

import std.stdio;
import std.string;
import std.datetime;
import std.range, std.algorithm;
import std.array;

int main(string[] args)
{
  if (args.length < 2) {
    return 1;
  }
  auto t=Clock.currTime();
  auto infile = File(args[1]);
  uint linect = 0;
  foreach (line; infile.byLine)
    linect += 1;
  auto t2=Clock.currTime-t;
  writefln("There are: %s lines and took %s", linect, t2);
  return 1;
}

Answer 4

除了python版本之外，这应该比你的版本更快：

module main;

import std.stdio;
import std.file;
import std.array;

void main(string[] args)
{
    auto infile = File(args[1]);
    auto buffer = uninitializedArray!(char[])(100);
    uint linect;
    while(infile.readln(buffer))
    {
        linect += 1;
    }
    writeln("There are: ", linect, " lines.");
}

Answer 5

tl; dr字符串是自动解码的，这使得splitLines变慢。

splitLines的当前实现动态解码字符串，这使得它变慢。在下一版的phobos中，这将是fixed。

还会有range为你做这件事。

一般来说，D GC不是最先进的，但是D让你有机会减少垃圾产生。要获得有竞争力的计划，您需要避免无用的分配。第二件大事：对于快速代码使用gdc或ldc，因为dmd的强度是快速生成代码而不是快速代码。

所以我没有时间，但是这个版本不应该在最大的行之后分配，因为它重用了bufferand并不解码UTF。

import std.stdio;

void main(string[] args)
{
    auto f = File(args[1]);
    // explicit mention ubyte[], buffer will be reused
    // no UTF decoding, only looks for "\n". See docs.
    int lineCount;
    foreach(ubyte[] line; std.stdio.lines(f))
    {
        lineCount += 1;
    }

    writeln("lineCount: ", lineCount);
}

如果需要，使用范围的版本可能如下所示每一行以终结符结束：

import std.stdio, std.algorithm;

void main(string[] args)
{
    auto f = File(args[1]);

    auto lineCount = f.byChunk(4096) // read file by chunks of page size 
`    .joiner // "concatenate" these chunks
     .count(cast(ubyte) '\n'); // count lines
    writeln("lineCount: ", lineCount);
}

在下一个版本中，只需要接近最佳性能即可打破所有突破空白的行。

void main(string[] args)
{
    auto f = File(args[1]);

    auto lineCount = f.byChunk(4096) // read file by chunks of page size 
     .joiner // "concatenate" these chunks
     .lineSplitter // split by line
     .walkLength; // count lines
    writeln("lineCount: ", lineCount);
}

Answer 6

int main()
{
    import std.mmfile;
    scope mmf = new MmFile(args[1]);
    foreach(line; splitter(cast(string)mmf[], "\n"))
    {
        ++linect;
    }
    writeln("There are: ", linect, " lines.");
    return 0;
}

改进D中的逐行I / O操作

6 个答案: