在python中找到词典列表中最常用的单词

时间:2017-09-19 20:53:08

标签: python dictionary count

我想知道如何从字典列表中获取最常用的单词。结构示例如下。

listDict = [{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer, and a smiling, relaxed developer sip lattes and calmly discuss how Flex is going to make customers happy and shorten the workday.'},
{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer'},
{'longDescription': 'Is going to make customers happy and shorten the workday.'},
{'longDescription': 'In the demo, a hip designer, a sharply-dressed marketer, and a smiling.'}]

所需的结果如上所述,按照最常见的词语排列:

[('word1', 7), 
('word2', 7), 
('word3', 3), 
('word4', 3), 
('word5', 3), 
('word6', 2), 
('word7', 2)]

2 个答案:

答案 0 :(得分:5)

这是一个有趣的方法:您可以使用Counter然后sum来计算单个项目。

from collections import Counter
import re

counts = sum((Counter(filter(None, re.split('\W+', v.lower())))
                    for x in listDict for v in x.values()), Counter())

print(counts.most_common(5))
[('a', 8), ('and', 5), ('the', 5), ('marketer', 3), ('designer', 3)]

正则表达式详细信息

\W+   # one or more characters that are not alphabets   

re.split根据正则表达式模式拆分文本。 filter将删除空字符串(这部分归功于Ajax1234)。

答案 1 :(得分:1)

如果可以合理地预期列表中的每个字典都具有相同的密钥(例如,您提供的示例中的“长描述”),则只需要几个步骤。循环遍历列表中的每个项目时,您需要清理字符串(str.lower()),将字符串拆分为单词(str.split()),然后将每个单词添加到单词计数字典中。幸运的是,每个步骤都可以使用python中的内置函数来完成。

from collections import defaultdict

# A defaultdict is nice because if a key is not already defined, the key
# will be added to the dictionary, and the value will go to a default. 
# Because we specify the default type to be an integer, that default value
# will be 0.
wordCount = defaultdict(int)
for dictionary in listDict:
    clean_str = dictionary['longDescription'].lower()
    words = clean_str.split(' ')
    for word in words:
        word_count[word] += 1