Question

我的句子很多（略超过100,000）。每个平均包含10个字。我试图将它们放到一个大列表中，以便我们可以从Counter库中collections来向我展示每个单词出现的频率。我目前正在做的是这样：

from collections import Counter
words = []
for sentence in sentenceList:
    words = words + sentence.split()
counts = Counter(words)

我想知道是否有一种方法可以更有效地完成同样的事情。我已经等了将近一个小时，以等待这段代码完成执行。我认为级联是花这么长时间的原因，因为如果我将行words = words + sentence.split()替换为print(sentence.split())，它会在几秒钟内完成执行。任何帮助将不胜感激。

Answer 1

如果您只想计算元素，则不要建立一个庞大的，占用大量内存的列表。继续使用新的可迭代对象更新Counter对象：

counts = Counter()
for sentence in sentenceList:
    counts.update(sentence.split())

Answer 2

您可以使用extend：

from collections import Counter
words = []
for sentence in sentenceList:
    words.extend(sentence.split())
counts = Counter(words)

或者，像这样的列表理解：

words = [word for sentence in sentenceList for word in sentence.split()]

如果以后不需要words，则可以将生成器传递到Counter：

counts = Counter(word for sentence in sentenceList for word in sentence.split())

连接大量列表的更有效方法？

2 个答案: