Question

我在处理Python中的大文件时遇到问题。我正在做的就是

f = gzip.open(pathToLog, 'r')
for line in f:
        counter = counter + 1
        if (counter % 1000000 == 0):
                print counter
f.close

这需要大约10m25s才能打开文件，读取行并递增此计数器。

在perl中，处理相同的文件并执行更多（一些正则表达式），整个过程大约需要1m17秒。

Perl代码：

open(LOG, "/bin/zcat $logfile |") or die "Cannot read $logfile: $!\n";
while (<LOG>) {
        if (m/.*\[svc-\w+\].*login result: Successful\.$/) {
                $_ =~ s/some regex here/$1,$2,$3,$4/;
                push @an_array, $_
        }
}
close LOG;

任何人都可以建议我能做些什么来使Python解决方案以与Perl解决方案类似的速度运行？

修改我尝试解压缩文件并使用open而不是gzip.open处理它，但这只会将总时间改为4m14.972s左右，这仍然太慢了。

我还删除了modulo和print语句，并用pass替换它们，所以现在所做的就是从一个文件移到另一个文件。

Answer 1

在Python中（至少＆lt; = 2.6.x），gzip格式解析在Python中实现（通过zlib）。此外，它似乎做了一些奇怪的事情，即将解压缩到文件末尾到内存，然后丢弃超出请求的读取大小的所有内容（然后再次执行以便下次读取）。 免责声明：我刚看了gzip.read() 3分钟，所以我在这里错了。无论我对gzip.read（）的理解是否正确，gzip模块似乎都没有针对大数据量进行优化。尝试做与Perl相同的事情，即启动外部流程（例如，参见模块subprocess）。

修改实际上，我错过了OP关于普通文件I / O和压缩一样慢的说法（感谢ire_and_curses指出它）。这让我觉得不太可能，所以我做了一些测量......

from timeit import Timer def w(n): L = "*"*80+"\n" with open("ttt", "w") as f: for i in xrange(n) : f.write(L) def r(): with open("ttt", "r") as f: for n,line in enumerate(f) : if n % 1000000 == 0 : print n def g(): f = gzip.open("ttt.gz", "r") for n,line in enumerate(f) : if n % 1000000 == 0 : print n

现在，运行它......

>>> Timer("w(10000000)", "from __main__ import w").timeit(1) 14.153118133544922 >>> Timer("r()", "from __main__ import r").timeit(1) 1.6482770442962646 # here i switched to a terminal and made ttt.gz from ttt >>> Timer("g()", "from __main__ import g").timeit(1)

...在喝茶休息并发现它还在运行之后，我已经杀了它，抱歉。然后我尝试了100'000行而不是10'000'000：

>>> Timer("w(100000)", "from __main__ import w").timeit(1) 0.05810999870300293 >>> Timer("r()", "from __main__ import r").timeit(1) 0.09662318229675293 # here i switched to a terminal and made ttt.gz from ttt >>> Timer("g()", "from __main__ import g").timeit(1) 11.939290046691895

模块gzip的时间是O（file_size ** 2），所以数量为数百万的行数，gzip读取时间不能与普通读取时间相同（我们看到实验证实）。 Anonymouslemming，请再次检查。

Answer 2

如果你谷歌“为什么python gzip慢”，你会发现很多关于这方面的讨论，包括改进Python 2.7和3.2的补丁。与此同时，像在Perl中那样使用zcat，这是快速的邪恶。你的（第一个）函数需要大约4.19s和5MB压缩文件，第二个函数需要0.78s。但是，我不知道你的未压缩文件发生了什么。如果我解压缩日志文件（apache日志）并使用简单的Python打开（文件）和Popen（'cat'）对它们运行两个函数，Python比cat（0.48s）更快（0.17s）。

#!/usr/bin/python

import gzip
from subprocess import PIPE, Popen
import sys
import timeit

#pathToLog = 'big.log.gz' # 50M compressed (*10 uncompressed)
pathToLog = 'small.log.gz' # 5M ""

def test_ori():
    counter = 0
    f = gzip.open(pathToLog, 'r')
    for line in f:
        counter = counter + 1
        if (counter % 100000 == 0): # 1000000
            print counter, line
    f.close

def test_new():
    counter = 0
    content = Popen(["zcat", pathToLog], stdout=PIPE).communicate()[0].split('\n')
    for line in content:
        counter = counter + 1
        if (counter % 100000 == 0): # 1000000
            print counter, line

if '__main__' == __name__:
    to = timeit.Timer('test_ori()', 'from __main__ import test_ori')
    print "Original function time", to.timeit(1)

    tn = timeit.Timer('test_new()', 'from __main__ import test_new')
    print "New function time", tn.timeit(1)

Answer 3

我花了一些时间。希望这段代码可以解决问题。它使用zlib而无需外部调用。

gunzipchunks 方法以块的形式读取压缩的gzip文件，可以迭代（生成器）。

gunziplines 方法读取这些未压缩的块，并一次为您提供一行，也可以迭代（另一个生成器）。

最后， gunziplinescounter 方法可以为您提供所需内容。

干杯！

import zlib

file_name = 'big.txt.gz'
#file_name = 'mini.txt.gz'

#for i in gunzipchunks(file_name): print i
def gunzipchunks(file_name,chunk_size=4096):
    inflator = zlib.decompressobj(16+zlib.MAX_WBITS)
    f = open(file_name,'rb')
    while True:
        packet = f.read(chunk_size)
        if not packet: break
        to_do = inflator.unconsumed_tail + packet
        while to_do:
            decompressed = inflator.decompress(to_do, chunk_size)
            if not decompressed:
                to_do = None
                break
            yield decompressed
            to_do = inflator.unconsumed_tail
    leftovers = inflator.flush()
    if leftovers: yield leftovers
    f.close()

#for i in gunziplines(file_name): print i
def gunziplines(file_name,leftovers="",line_ending='\n'):
    for chunk in gunzipchunks(file_name): 
        chunk = "".join([leftovers,chunk])
        while line_ending in chunk:
            line, leftovers = chunk.split(line_ending,1)
            yield line
            chunk = leftovers
    if leftovers: yield leftovers

def gunziplinescounter(file_name):
    for counter,line in enumerate(gunziplines(file_name)):
        if (counter % 1000000 != 0): continue
        print "%12s: %10d" % ("checkpoint", counter)
    print "%12s: %10d" % ("final result", counter)
    print "DEBUG: last line: [%s]" % (line)

gunziplinescounter(file_name)

这比在非常大的文件上使用内置gzip模块要快得多。

Answer 4

你的电脑花了10分钟？它必须是你的硬件。我写了这个函数写了500万行：

def write():
    fout = open('log.txt', 'w')
    for i in range(5000000):
        fout.write(str(i/3.0) + "\n")
    fout.close

然后我用类似你的程序阅读它：

def read():
    fin = open('log.txt', 'r')
    counter = 0
    for line in fin:
        counter += 1
        if counter % 1000000 == 0:
            print counter
    fin.close

我的计算机花了大约3秒钟才读完所有500万行。

Answer 5

尝试使用StringIO缓冲gzip模块的输出。以下代码用于读取gzipped pickle，使我的代码执行时间缩短了90％以上。

而不是......

import cPickle

# Use gzip to open/read the pickle.
lPklFile = gzip.open("test.pkl", 'rb')
lData = cPickle.load(lPklFile)
lPklFile.close()

使用...

import cStringIO, cPickle

# Use gzip to open the pickle.
lPklFile = gzip.open("test.pkl", 'rb')

# Copy the pickle into a cStringIO.
lInternalFile = cStringIO.StringIO()
lInternalFile.write(lPklFile.read())
lPklFile.close()

# Set the seek position to the start of the StringIO, and read the
# pickled data from it.
lInternalFile.seek(0, os.SEEK_SET)
lData = cPickle.load(lInternalFile)
lInternalFile.close()

Python文本文件处理速度问题

5 个答案: