Question

我正在用Python 2.6编写一个脚本（我是python的新手）。我想要实现的是最有效的方法：

扫描约300,000个.bin文件
每个文件介于500mb和900mb之间
拉出位于每个文件中的2个字符串（它们都位于文件的开头）
将每个文件的输出放在一个.txt文件中

我编写了以下脚本，该脚本有效，但它处理每个文件的速度都很慢。它在过去50分钟左右处理了大约118个文件：

 import re, os, codecs

 path = "./" #will search current directory
 dir_lib = os.listdir(path)

 for book in dir_lib:
    if not book.endswith('.bin'): #only looks for files that have .bin extension
            continue
    file = os.path.join(path, book)
    text = codecs.open(file, "r", "utf-8", errors="ignore") 

    #had to use "ignore" because I kept getting error with binary files: 
    #UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 10: 
    #unexpected code byte

    for lineout in text:
            w = re.search("(Keyword1\:)\s(\[(.+?)\])", lineout)
            d = re.search("Keyword2\s(\[(.+?)\])", lineout)

            outputfile = open('output.txt', 'w')

            if w:
                    lineout = w.group(3) #first keyword that is between the [ ]
                    outputfile.write(lineout + ",")
            elif d:
                    lineout = d.group(2) #second keyword that is between the [ ]
                    outputfile.write(lineout + ";")

           outputfile.close()
    text.close()

我的输出很好，正是我想要的：

 keyword1,keyword2;keyword1,keyword2;etc,...;

但是这个速度需要大约一个月左右的时间才能连续运行。我可能尝试的其他任何东西，可能是正则表达式的替代品吗？一种方法是它不扫描整个文件，只是在找到关键字之后转到下一个文件？

感谢您的建议。

Answer 1

一种方法是在unix操作系统中欺骗和模仿grep，试试http://nedbatchelder.com/code/utilities/pygrep.py

import os

# Get the pygrep script.
if not os.path.exists('pygrep.py'):
    os.system("wget http://nedbatchelder.com/code/utilities/pygrep.py")
from pygrep import grep, Options

# Writes a test file.
text="""This is a text
somehow there are many foo bar in the world.
sometimes they are black sheep, 
sometimes they bar bar black sheep.
most times they foo foo here
and a foo foo there"""
with open('test.txt','w') as fout:
    fout.write(text)

# Here comes the query
queries = ['foo','bar']

opt = Options() # set options for grep.
with open('test.txt','r') as fin:
    for i in queries:
        grep(i, fin, opt)
print

Answer 2

您可以通过至少三种方式改进代码（按重要性降序排列）：

找到两行时，不会突破内部for循环。这意味着尽管事实上在文件开头的某处找到了两行，但脚本将遍历整个文件。
如果所有文件的正则表达式模式相同，则应在外部for循环外编译正则表达式。如果他们在不同文件之间进行更改，请将它们放在内部for循环之外。就目前而言，每次迭代都会创建一个新的regexp对象。

注意：可能不是这种情况，因为recent patterns are cached最多。（但没有充分的理由不这样做）

此外，您不应在每次迭代时打开和关闭输出文件。

以下代码解决了这些问题：

import re, os, codecs

path = "./"
dir_lib = os.listdir(path)
w_pattern = re.compile("(Keyword1\:)\s(\[(.+?)\])")
d_pattern = re.compile("Keyword2\s(\[(.+?)\])")

with open('output.txt', 'w') as outputfile:
    for book in dir_lib:
        if not book.endswith('.bin'):
            continue
        filename = os.path.join(path, book)
        with codecs.open(filename, "r", "utf-8", errors="ignore") as text:
            w_found, d_found = False, False
            for lineout in text:
                w = w_pattern.search(lineout)
                d = d_pattern.search(lineout)
                if w:
                    lineout = w.group(3)
                    outputfile.write(lineout + ",")
                    w_found = True
                elif d:
                    lineout = d.group(2)
                    outputfile.write(lineout + ";")
                    d_found = True
                if w_found and d_found:
                    break

Answer 3

一些可能适用或可能不适用的简化：

我假设Keyword1和Keyword2都出现在一行的开头（所以我可以使用re.match而不是re.search）
我假设Keyword1将始终出现在Keyword2之前（所以我可以搜索一个，然后另一个=一半的呼叫）：

所以：

import codecs
import glob
import re

START = re.compile("Keyword1\:\s\[(.+?)\]").match
END   = re.compile("Keyword2\:\s\[(.+?)\]").match

def main():
    with open('output.txt', 'w') as outf:
        for fname in glob.glob('*.bin'):
            with codecs.open(fname, 'rb', 'utf-8', errors='ignore') as inf:
                w = None
                for line in inf:
                    w = START(line)
                    if w:
                        break

                d = None
                for line in inf:
                    d = END(line)
                    if d:
                        break

                if w and d:
                    outf.write('{0},{1};'.format(w.group(2), d.group(2)))

if __name__=="__main__":
    main()

使用Python在大文件中搜索多个字符串

3 个答案: