Question

我有一个文件dataset.nt，它不是太大（300Mb）。我还有一个列表，其中包含大约500个元素。对于列表中的每个元素，我想计算文件中包含它的行数，并将该键/值对添加到字典中（键是列表元素的名称，值是次数）此元素出现在文件中。

这是我厌倦了实现这一结果的第一件事：

mydict = {}

for i in mylist:
    regex = re.compile(r"/Main/"+re.escape(i))
    total = 0
    with open("dataset.nt", "rb") as input:
        for line in input:
            if regex.search(line):
                total = total+1
    mydict[i] = total

它不起作用（因为它无限期地运行），我想我应该找到一种方法，不要每行读500次。所以我尝试了这个：

mydict = {}

with open("dataset.nt", "rb") as input:
    for line in input:
        for i in mylist:
            regex = re.compile(r"/Main/"+re.escape(i))
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

性能不提高，脚本仍无限期运行。所以我用Google搜索，我试过了：

mydict = {}

file = open("dataset.nt", "rb")

while 1:
    lines = file.readlines(100000)
    if not lines:
        break
    for line in lines:
        for i in list:
            regex = re.compile(r"/Main/"+re.escape(i))
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

那个人已经跑了30分钟了，所以我假设它没有更好。

我应该如何构建此代码，以便在合理的时间内完成？

Answer 1

我赞成稍微修改你的第二个版本：

mydict = {}

re_list = [re.compile(r"/Main/"+re.escape(i)) for i in mylist]
with open("dataset.nt", "rb") as input:
    for line in input:
        # any match has to contain the "/Main/" part
        # -> check it's there
        # that may help a lot or not at all
        # depending on what's in your file
        if not '/Main/' in line:
            continue 

        # do the regex-part
        for i, regex in zip(mylist, re_list):
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

正如@matsjoyce已经建议的那样，这避免了在每次迭代时重新编译正则表达式。如果你真的需要那么多不同的正则表达式模式，那么我就不会认为你可以做很多事情。

也许值得检查一下你是否可以正则表达式捕获＆＃34; / Main /＆＃34;之后发生的任何事情。然后将其与您的列表进行比较。这可能有助于减少＆＃34;真实＆＃34;正则表达式搜索。

Answer 2

看起来像某个map / reduce的好候选者就像并行化......你可以将数据集文件拆分为N个块（其中N =你有多少个处理器），每个扫描一个块启动N个子进程，然后对结果求和。

这当然不会阻止您首先优化扫描，即（基于塞巴斯蒂安的代码）：

targets = [(i, re.compile(r"/Main/"+re.escape(i))) for i in mylist]
results = dict.fromkeys(mylist, 0)

with open("dataset.nt", "rb") as input:
    for line in input:
        # any match has to contain the "/Main/" part
        # -> check it's there
        # that may help a lot or not at all
        # depending on what's in your file
        if '/Main/' not in line:
            continue 

        # do the regex-part
        for i, regex in targets:
            if regex.search(line):
                results[i] += 1

请注意，如果您从数据集中发布了样本，则可以更好地优化此选项。例如，如果您的数据集可以按＆＃34; / Main / {i}＆＃34;进行排序。（例如，使用系统sort程序），您不必检查i的每个值的每一行。或者如果＆＃34; / Main /＆＃34;的位置在行中已知且已修复，您可以在字符串的相关部分使用简单的字符串比较（可以比正则表达式更快）。

Answer 3

其他解决方案非常好。但是，由于每个元素都有一个正则表达式，并且如果元素每行出现多次并不重要，您可以使用re.findall计算包含目标表达式的行。

在某些线条之后，最好还是读取孔文件（如果你有足够的内存而且不是设计限制）。

    import re

    mydict = {}
    mylist = [...] # A list with 500 items
    # Optimizing calls
    findall = re.findall  # Then python don't have to resolve this functions for every call
    escape = re.escape

    with open("dataset.nt", "rb") as input:
        text = input.read() # Read the file once and keep it in memory instead access for read each line. If the number of lines is big this is faster.
        for elem in mylist:
            mydict[elem] = len(findall(".*/Main/{0}.*\n+".format(escape(elem)), text)) # Count the lines where the target regex is.

我用大小为800Mb的文件测试了这个（我想看看有多少时间把这么大的文件加载到内存中，你觉得会更快）。

我不会使用真实数据测试整个代码，只测试findall部分。

循环文件每一行的最有效方法是什么？

3 个答案: