Question

我让用户在列表中输入要过滤掉的项目。从那里它过滤使用：

while knownIssuesCounter != len(newLogFile):
    for line in knownIssues:
        if line in newLogFile[knownIssuesCounter]:
            if line not in issuesFound:
                issuesFoundCounter[line]=1
                issuesFound.append(line)
                issuesFound.append(knownIssues[line])
            else:
                issuesFoundCounter[line]=issuesFoundCounter[line] + 1
    knownIssuesCounter +=1

我遇到了一百万兆字节的日志文件，它正在进行FOREVER ..... 有没有更好的方法可以用Python做到这一点？

Answer 1

尝试从列表中更改issuesFound以设置：

issuesFound = set()

并使用add代替append：

issuesFound.add(line)

Answer 2

你的代码速度很慢的很大一部分原因是if line not in issuesFound:。这需要通过一个巨大的列表进行线性搜索。

您可以通过添加set看到的问题（实际上可以免费搜索）来解决这个问题。这会减少从O（NM）到O（N）的时间。

但实际上，你可以通过完全删除if来使这更简单。

首先，您可以在issuesFound的键后生成issuesFoundCounter列表。对于issuesFoundCounter中的每一行，您需要该行，然后是knownIssues[line]。所以：

issuesFound = list(flatten((line, knownIssues[line]) for line in issuesFoundCounter))

（我正在使用flatten文档中的itertools食谱。您可以将其复制到代码中，或者只能使用itertools.chain.from_iterable代替{{1}来编写代码}。）

这意味着您只需搜索flatten而不是if line not in issuesFoundCounter:，这已经是in issuesFound:（因此可以有效地免费搜索）。但是，如果您只是使用dict - 或者更简单，请使用setdefault或defaultdict而不是Counter - 您可以自动进行此操作。

所以，如果dict是issuesFoundCounter，整个事情就会减少到这个：

Counter

你可以将它变成一个生成器表达式，消除Python中的慢速显式循环，在解释器内部更快地循环。这只是一个固定的5：1加速，而不是从上半年的线性到恒定的加速，但它仍然值得考虑：

for newLogLine in newLogFile:
    for line in knownIssues:
        if line in newLogLine:
            issuesFoundCounter[line] += 1

唯一的问题是issuesFoundCounter = collections.Counter(line for newLogLine in newLogFile for line in knownIssues if line in newLogLine)列表现在是按任意顺序排列，而不是按找到问题的顺序排列。如果这很重要，只需使用issuesFound代替OrderedCounter。 Counter文档中有一个简单的配方，但对于您的情况，它可以简单如下：

collections

过滤大量的linux日志文件

2 个答案: