Question

我有一个包含数万行ASCII文本的文本文件。我有一个我要搜索的几百个关键字的列表，分别考虑每一行。最初，如果有任何匹配，我想返回（打印到屏幕或文件）该行但最终我想根据匹配的数量对返回的行进行排名或排序。

所以，我的名单是这样的......

keywords = ['one', 'two', 'three']

我的思路如下：

myfile = open('file.txt')
for line in myfile:
    if keywords in line:
        print line

但是从psuedo到工作代码的这种情况并没有发生。

我也想过使用RegEx：

print re.findall(keywords, myfile.read())

但这导致我走上了不同错误和问题的道路。

如果有人可以提供一些指导，语法或代码片段，我将不胜感激。

Answer 1

您无法测试字符串中是否有列表。你可以做的是测试另一个字符串中是否有一个字符串。

lines = ['this is a line without any keywords', 
         'this is a line with one', 
         'this is a line with one and two',
         'this is a line with three']
keywords = ['one', 'two', 'three']

for line in lines:
    for word in keywords:
        if word in line:
            print(line)
            break

当匹配第一个单词时，break必须突破“单词”循环。否则，它将打印匹配的每个单词的行。

正则表达式解决方案存在同样的问题。您可以使用与上面相同的解决方案，并在单词上添加一个额外的循环，或者您可以构建一个自动匹配任何单词的正则表达式。请参阅Python regex syntax文档。

for line in lines:
    matches = re.findall('one|two|three', line)
    if matches:
        print(line, len(matches))

请注意，re.findall如果没有匹配则返回空列表，如果匹配则返回所有匹配列表。因此，我们可以直接在if条件中测试结果，因为空列表评估为False。

您还可以轻松地为这些简单的案例生成正则表达式模式：

pattern = '|'.join(keywords)
print(pattern)
# 'one|two|three'

要对它们进行排序，您只需将它们放入元组列表中，然后使用key的{{1}}参数。

sorted

您可以阅读results = [] for line in lines: matches = re.findall('one|two|three', line) if matches: results.append((line, len(matches))) results = sorted(results, key=lambda x: x[1], reverse=True)的{{3}}，但sorted参数提供了用于排序的函数。在这种情况下，我们提取每个元组的第二个元素，这是我们存储该行中匹配数的位置，并使用该元素对列表进行排序。

这是您可以将其应用于实际文件并保存结果的方法。

key

您可以阅读documentation，但在这种情况下，它基本上可以确保您在完成文件后关闭该文件。

Answer 2

来自Counter的

collections module似乎非常适合这个问题。我会做这样的事情。

from collections import Counter

keywords = ['one', 'two', 'three']
lines = ['without any keywords', 'with one', 'with one and two']

matches = []
for line in lines: 
    # Takes all the words in the line and gets the number of times 
    # they appear as a dictionary-like Counter object.
    words = Counter(line.split())

    line_matches = 0
    for kw in keywords:
        # Get the number of times it popped up in the line
        occurrences = words.get(kw, 0)
        line_matches += occurrences

    matches.append((line, line_matches))

# Sort by the number of occurrences per line, descending.
print(sorted(matches, key=lambda x: x[1], reverse=True))

输出：

[('with one and two', 2), ('with one', 1), ('without any keywords', 0)]

Answer 3

你没有在你的问题中指明它，但根据我的说法，如果多次找到一个关键词，它应该只计算一个得分（这个优势行有更多不同的关键词）：

def getmatching(lines, keywords):
    result = []
    keywords = set(keywords)
    for line in lines:
        matches = len(keywords & set(line.split()))
        if matches:
            result.append((matches, line))
    return (line for matches, line in sorted(result, reverse=True))

实施例

lines = ['no keywords here', 'one keyword here',
         'two keywords in this one line', 'three minus two equals one',
         'one counts only one time because it is only one keyword']

keywords = ['one', 'two', 'three']

for line in getmatching(lines, keywords):
    print line

输出

three minus two equals one
two keywords in this one line
one keyword here
one counts only one time because it is only one keyword

使用Python在文件行中搜索列表条目

3 个答案:

实施例