Question

我有一个包含一百万行的gzip压缩文件：

$ zcat million_lines.txt.gz | head
1
2
3
4
5
6
7
8
9
10
...

我处理此文件的Perl脚本如下：

# read_million.pl
use strict; 

my $file = "million_lines.txt.gz" ;

open MILLION, "gzip -cdfq $file |";

while ( <MILLION> ) {
    chomp $_; 
    if ($_ eq "1000000" ) {
        print "This is the millionth line: Perl\n"; 
        last; 
    }
}

在Python中：

# read_million.py
import gzip

filename = 'million_lines.txt.gz'

fh = gzip.open(filename)

for line in fh:
    line = line.strip()
    if line == '1000000':
        print "This is the millionth line: Python"
        break

无论出于何种原因，Python脚本的使用时间差不多大约8倍：

$ time perl read_million.pl ; time python read_million.py
This is the millionth line: Perl

real    0m0.329s
user    0m0.165s
sys     0m0.019s
This is the millionth line: Python

real    0m2.663s
user    0m2.154s
sys     0m0.074s

我尝试分析这两个脚本，但实际上没有太多代码可以分析。 Python脚本大部分时间都花在for line in fh上; Perl脚本大部分时间都花在if($_ eq "1000000")上。

现在，我知道Perl和Python有一些预期的差异。例如，在Perl中，我使用subproc到UNIX gzip命令打开文件句柄;在Python中，我使用gzip库。

我可以做些什么来加速这个脚本的Python实现（即使我从未达到Perl性能）？也许Python中的gzip模块很慢（或者我使用它的方式很糟糕）;有更好的解决方案吗？

编辑＃1

以下是read_million.py逐行分析的结果。

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           @profile
     3                                           def main():
     4
     5         1            1      1.0      0.0         filename = 'million_lines.txt.gz'
     6         1          472    472.0      0.0         fh = gzip.open(filename)
     7   1000000      5507042      5.5     84.3         for line in fh:
     8   1000000       582653      0.6      8.9                 line = line.strip()
     9   1000000       443565      0.4      6.8                 if line == '1000000':
    10         1           25     25.0      0.0                         print "This is the millionth line: Python"
    11         1            0      0.0      0.0                         break

编辑＃2：

我现在也根据@Kirk Strauser和其他人尝试了subprocess python模块。它更快：

Python“subproc”解决方案：

# read_million_subproc.py 
import subprocess

filename = 'million_lines.txt.gz'
gzip = subprocess.Popen(['gzip', '-cdfq', filename], stdout=subprocess.PIPE)
for line in gzip.stdout: 
    line = line.strip()
    if line == '1000000':
        print "This is the millionth line: Python"
        break
gzip.wait()

这是我迄今为止尝试过的所有事情的对照表：

method                    average_running_time (s)
--------------------------------------------------
read_million.py           2.708
read_million_subproc.py   0.850
read_million.pl           0.393

Answer 1

在测试了许多可能性之后，看起来这里的罪魁祸首是：

将苹果与橙子进行比较：在原始测试用例中，Perl没有进行文件I / O或解压缩工作，gzip程序正在这样做（并且用C语言编写），所以它运行得非常快）;在该版本的代码中，您将并行计算与串行计算进行比较。
译员启动时间;在绝大多数系统上，Python开始运行需要更长的时间（我相信因为启动时会加载更多文件）。我的机器上的解释器启动时间大约是总挂钟时间的一半，用户时间的30％，以及大部分系统时间。在Python中完成的实际工作被启动时间所淹没，因此您的基准测试与将启动时间作为比较执行工作所需的时间相同。 稍后添加：通过使用python开关调用-E，可以进一步减少Python启动的开销（在启动时禁用PYTHON*环境变量检查）和-S开关（禁用自动import site，这避免了大量动态sys.path设置/操作涉及磁盘I / O，代价是切断对任何非内置的访问库）。
Python的subprocess模块比Perl的open调用高一点，并且用Python实现（在低级基元之上）。广义subprocess代码加载时间更长（加剧启动时间问题）并增加了流程启动本身的开销。
Python 2＆＃39; s subprocess默认为无缓冲I / O，因此您执行更多系统调用除非您传递显式bufsize参数（4096到8192似乎工作正常））
line.strip()电话涉及的开销超出您的想象;功能与Python中的方法调用比实际应该更昂贵，并且line.strip()不会像Perl str那样改变chomp（因为Python＆＃39; s） str是不可变的，而Perl字符串是可变的）

将绕过大多数这些问题的几个版本的代码。首先，优化subprocess：

#!/usr/bin/env python

import subprocess

# Launch with subprocess in list mode (no shell involved) and
# use a meaningful buffer size to minimize system calls
proc = subprocess.Popen(['gzip', '-cdfq', 'million_lines.txt.gz'], stdout=subprocess.PIPE, bufsize=4096)
# Iterate stdout directly
for line in proc.stdout:
    if line == '1000000\n':  # Avoid stripping
        print("This is the millionth line: Python")
        break
# Prevent deadlocks by terminating, not waiting, child process
proc.terminate()

其次，纯Python，主要是内置（C级）基于API的代码（它消除了大多数无关的启动开销，并且表明Python的gzip模块与{{1}没有明显的区别程序），以可读性/可维护性/简洁性/可移植性为代价进行微观优化：

gzip

在我的本地系统上，在最好的六次运行中，#!/usr/bin/env python import os rpipe, wpipe = os.pipe() def reader(): import gzip FILE = "million_lines.txt.gz" os.close(rpipe) with gzip.open(FILE) as inf, os.fdopen(wpipe, 'wb') as outf: buf = bytearray(16384) # Reusable buffer to minimize allocator overhead while 1: cnt = inf.readinto(buf) if not cnt: break outf.write(buf[:cnt] if cnt != 16384 else buf) pid = os.fork() if not pid: try: reader() finally: os._exit() try: os.close(wpipe) with os.fdopen(rpipe, 'rb') as f: for line in f: if line == b'1000000\n': print("This is the millionth line: Python") break finally: os.kill(pid, 9)代码需要：

subprocess

基于原始代码的Python代码没有外部实用程序，可以达到以下最佳时间：

0.173s/0.157s/0.031s wall/user/sys time.

（虽然这是一个异常值;一个好的挂钟时间通常更像是0.165）。通过删除设置导入机制来处理非内置函数的开销，在调用中添加0.147s/0.103s/0.013s可以节省另外0.01-0.015秒的挂钟和用户时间;在其他评论中，你提到你的Python需要将近0.6秒的时间才能完成任何事情（但其他情况似乎与我的相似），这可能表明你对非默认包的方式有了更多的了解或者进行环境定制，-E -S可以为您节省更多。

Perl代码，未修改你给我的内容（除了使用3+ arg -E -S删除字符串解析并将open从pid返回到显式open它在退出前有一个最好的时间：

kill

无论如何，我们谈论的是微不足道的差异（壁挂时间和用户时间从运行到运行的时间抖动约为0.025秒，因此Python在挂钟时间上的胜利大多是微不足道的，尽管它确实有效地节省了用户时间）。与Perl一样，Python可以获胜，但非语言相关的问题更为重要。

Answer 2

如果我是下注者，我会下注：

line = line.strip()

是杀手。它正在进行方法查找（即解析line.strip），然后调用它来创建另一个对象，然后将名称line分配给新创建的对象。

鉴于你确切知道你的数据会是什么样子，我会看到将你的循环更改为是否会产生影响：

for line in fh: 
    if line == '1000000\n':
        ...

我想我跳了枪并且回答得太快了。我相信你是对的：Perl通过在一个单独的进程中运行gzip来“欺骗”。查看Asynchronously read stdout from subprocess.Popen以了解在Python中执行相同操作的方法。它可能看起来像：

import subprocess

filename = 'million_lines.txt.gz'
gzip = subprocess.Popen(['gzip', '-cdfq', filename], stdout=subprocess.PIPE)
for line in iter(gzip.stdout.readline, ''): 
    line = line.strip()
    if line == '1000000':
        print "This is the millionth line: Python"
        break
gzip.wait()

在你这样做之后，请报告回来。我希望看到这个实验的结果！

Answer 3

你让我好奇......

以下Python脚本始终优于我的机器上的Perl解决方案：3.2s与10,000,000行的3.6s（由time的三次运行实现的实时时间）

import subprocess

filename = 'millions.txt.gz'
gzip = subprocess.Popen(
    ['gzip', '-cdfq', filename],
    bufsize = -1, stdout = subprocess.PIPE)

for line in gzip.stdout:
    if line[:-1] == '10000000':
        print "This is the 10 millionth line: Python"
        break

gzip.wait()

有趣的是，在查看用户模式所花费的时间时，Perl解决方案比Python解决方案略胜一筹。这似乎表明Python解决方案的进程间通信比Perl解决方案更有效。

Answer 4

这个版本比Perl版本快，但它假设行结束是＆＃39; \ n＆＃39;：

import subprocess

filename = "million_lines.txt.gz"
gzip = subprocess.Popen(['gzip', '-cdfq', filename], stdout=subprocess.PIPE)
for line in gzip.stdout:
    if line == '1000000\n':
        print "This is the millionth line: Python"
        break
gzip.terminate()

<强>测试

$ time python Test.py 
This is the millionth line: Python

real    0m0.191s
user    0m0.264s
sys     0m0.016s

$ time perl Test.pl 
This is the millionth line: Perl

real    0m0.404s
user    0m0.488s
sys     0m0.008s

Answer 5

看起来像next()中使用的gzip文件的for line in方法似乎非常慢 - 大概是因为它小心翼翼地读取未压缩的流寻找换行符，也许是为了控制内存使用。

当然，您正在将苹果与橙子进行比较，而其他人已经在Python分叉gunzip和Perl分叉gunzip之间进行了更好的比较。这些可能很有效，因为它们在一个单独的进程中将相对较大的未压缩字符串转储到它们的标准输出中。

非内存安全且可能浪费的方法是：

import gzip

filename = 'million_lines.txt.gz'

fh = gzip.open(filename)

whole_file = fh.read()
for line in whole_file.splitlines():
    if line == "1000000":
        print "This is the millionth line: Python"
        break

这将读取整个未压缩文件然后拆分。

<强>结果：

$ time python test201604121.py
This is the millionth line: Python

real    0m0.183s
user    0m0.133s
sys    0m0.046s


$ time perl test201604121.pl

This is the millionth line: Perl

real    0m0.192s
user    0m0.167s
sys    0m0.027s

Python vs Perl：读取gzip压缩文件的性能

编辑＃1

5 个答案: