将单词数组的列表合并为一个数组

Question

我有一个列表列表，如下所示。

sentences = [
    ["my", "first", "question", "in", "stackoverflow", "is", "my", "favorite"], 
    ["my", "favorite", "language", "is", "python"]
]

我想获取sentences列表中每个单词的计数。因此，我的输出应如下所示。

{
    'stackoverflow': 1,
     'question': 1,
     'is': 2,
     'language': 1,
     'first': 1,
     'in': 1,
     'favorite': 2,
     'python': 1,
     'my': 3
}

我目前正在执行以下操作。

frequency_input = [item for sublist in sentences for item in sublist]
frequency_output = dict(
    (x,frequency_input.count(x)) 
    for x in set(frequency_input)
)

但是，对于长列表而言，它根本没有效率。我的名单很长，名单上有大约一百万个句子。我花了两天时间来运行它，并且它仍在运行。

在这种情况下，我想提高程序效率。我当前的代码第一行是O(n^2)，第二行是O(n)。请让我知道python中是否有更有效的方法。如果我可以用比现在少的时间运行它，那将是非常理想的。我不担心空间的复杂性。

很高兴在需要时提供更多详细信息。

Answer 1

一种更简单，更高效的方法是使用itertools.chain来平滑列表，并使用collections.Counter来计算字符串：

from collections import Counter
from itertools import chain

Counter(chain.from_iterable(sentences))

Counter({'my': 3,
         'first': 1,
         'question': 1,
         'in': 1,
         'stackoverflow': 1,
         'is': 2,
         'favorite': 2,
         'language': 1,
         'python': 1})

Answer 2

您可以使用“收藏夹”模块中的Counter类。

如果要分别学习每个句子中的单词数，可以执行以下操作

from collections import Counter

sentences = [["my", "first", "question", "in", "stackoverflow", "is", "my", "favorite"], ["my", "favorite", "language", "is", "python"]]

counter_list = [dict(Counter(sentence)) for sentence in sentences]
print(counter_list)

输出：

[{'my': 2, 'first': 1, 'question': 1, 'in': 1, 'stackoverflow': 1, 'is': 1, 'favorite': 1}, {'my': 1, 'favorite': 1, 'language': 1, 'is': 1, 'python': 1}]

或者，如果您希望总字数，可以使用itertools模块中的链方法。

import itertools
from collections import Counter

sentences = [["my", "first", "question", "in", "stackoverflow", "is", "my", "favorite"], ["my", "favorite", "language", "is", "python"]]

sentences = list(itertools.chain.from_iterable(sentences))
word_counts = Counter(sentences)
print(word_counts)

输出：

Counter({'my': 3, 'is': 2, 'favorite': 2, 'first': 1, 'question': 1, 'in': 1, 'stackoverflow': 1, 'language': 1, 'python': 1})

Counter对象的复杂性如documentation所示，Counter是用于计算可哈希对象的dict子类。因此从可迭代对象构造计数器对象的时间复杂度为O（n）

Answer 3

sentences = [["my", "first", "question", "in", "stackoverflow", "is", "my", "favorite"], ["my", "favorite", "language", "is", "python"]]

combinedList = []

将单词数组的列表合并为一个数组

def my_function (my_list): for list in my_list: combinedList.extend(list) print(combinedList) my_function(sentences)

在单词数组上使用计数功能

countData = {}

for word in combinedList: countData[word] = combinedList.count(word)

计算嵌套列表中的字符串

3 个答案:

将单词数组的列表合并为一个数组

在单词数组上使用计数功能

countData将具有每个单词的计数