Question

我有一个问题可以帮助简化我的编程。所以我有这个文件text.txt，其中我想查看它并将其与单词words列表进行比较，每次找到该单词时，它会将1添加到整数。

words = ['the', 'or', 'and', 'can', 'help', 'it', 'one', 'two']
ints = []
with open('text.txt') as file:
    for line in file:
        for part in line.split():
            for word in words:
                if word in part:
                    ints.append(1)

我只是想知道是否有更快的方法来做到这一点？文本文件可能会更大，单词列表会更大。

Answer 1

您可以将words转换为set，以便查找速度更快。这应该可以为您的程序提供良好的性能提升，因为查找列表中的值必须一次遍历列表一个元素（O（n）运行时复杂性），但是当您将列表转换为集合时，运行时复杂性将降低到O（1）（恒定时间）。因为集合使用哈希来查找元素。

words = {'the', 'or', 'and', 'can', 'help', 'it', 'one', 'two'}

然后，只要匹配，您就可以使用sum函数来计算它

with open('text.txt') as file:
    print(sum(part in words for line in file for part in line.split()))

布尔值及其整数等值

在Python中，布尔表达式的结果将分别等于0和1 False和True。

>>> True == 1
True
>>> False == 0
True
>>> int(True)
1
>>> int(False)
0
>>> sum([True, True, True])
3
>>> sum([True, False, True])
2

因此，每当您检查part in words时，结果将是0或1，我们sum所有这些值。

上面看到的代码在功能上等同于

result = 0
with open('text.txt') as file:
    for line in file:
        for part in line.split():
            if part in words:
                 result += 1

注意：如果您确实希望在匹配时在列表中获得1，那么您只需将生成器表达式转换为{{1列表理解，像这样

sum

文字的频率

如果您确实想在with open('text.txt') as file: print([int(part in words) for line in file for part in line.split()])中找到单个字词的频率，那么您可以像这样使用collections.Counter

words

这将在内部计算from collections import Counter with open('text.txt') as file: c = Counter(part for line in file for part in line.split() if part in words)中每个单词出现在文件中的次数。

根据the comment，你可以有一个字典，你可以存储带有正分数的正面单词，带负面分数的负面单词，并按照这样计算

words

在这里，我们使用words = {'happy': 1, 'good': 1, 'great': 1, 'no': -1, 'hate': -1} with open('text.txt') as file: print(sum(words.get(part, 0) for line in file for part in line.split()))字典来获取存储在单词中的值，如果在字典中找不到单词（既不是好词也不是坏词），则返回默认值{{1} }。

Answer 2

您可以使用set.intersection查找集合和列表之间的交集，以便更有效地将您的文字放在set中并执行：

words={'the','or','and','can','help','it','one','two'}
ints=[]
with open('text.txt') as f:
    for line in f:
        for _ in range(len(words.intersection(line.split()))):
              ints.append(1)

请注意，上述解决方案基于您在列表中添加1的代码。您希望找到最终计数，您可以在sum中使用生成器表达式：

words={'the','or','and','can','help','it','one','two'}
with open('text.txt') as f:
    sum(len(words.intersection(line.split())) for line in f)

比较文本文件内容的最快方法

2 个答案: