Question

正如标题所说，我需要编写一个代码，该代码返回频率最高的3个单词（来自输入字符串）的列表。这就是我到目前为止所做的：

IN：

import collections

print(sstr)

OUT：

['22574999', 'communication was sent']
['22582857', 'message originated from an industrial area in pacoima']
['22585166', 'your message will never be delivered']
['22585424', 'message has been delivered ']

在：

import collections

id = sstr[0]
info = (sstr[1]).split()
print(id,info)

OUT：

22574999 ['communication', 'was', 'sent']
22582857 ['message', 'originated', 'from', 'an', 'industrial', 'area', 'in', 'pacoima']
22585166 ['your', 'message', 'will', 'never', 'be', 'delivered']
22585424 ['message', 'has', 'been', 'delivered']

在：

import collections

id = sstr[0]
info = (sstr[1]).split()
c = collections.Counter()

for word in info:
    c[word] += 1

print(c.most_common(3))

OUT：

Counter({'communication': 1, 'was': 1, 'sent': 1})
Counter({'message': 1, 'originated': 1, 'from': 1, 'an': 1, 'industrial': 1, 'area': 1, 'in': 1, 'pacoima': 1})
Counter({'your': 1, 'message': 1, 'will': 1, 'never': 1, 'be': 1, 'delivered': 1})
Counter({'message': 1, 'has': 1, 'been': 1, 'delivered': 1})

我希望将所有行合并为一个并找到频率最高的前3个单词。如何找到最高频率前三个单词的id总和？

我想得到以下结果

结果：

top 3 words with highest frequency:

message :3 
delivered:2    
communication:1

sum of id in which there аре top 3 words with highest frequency:

message:3       Is included (22582857,22585166,22585424 )     
delivered:2     Is included(22585166,22585424)
communication:1 Is included (22574999)

Answer 1

from collections import Counter, defaultdict

messages = [
    ['364616', 'baa baa black sheep'],
    ['364617', 'have you any wool'],
    ['364618', 'yes sir yes sir'],
    ['364619', 'three bags full'],
    ['364620', 'one for the master'],
    ['364621', 'and one for the dame'],
    ['364622', 'and one for the little boy'],
    ['364623', 'who lives down the lane']]

word_counts = Counter()
word_to_msgids = defaultdict(set)

for msgid, msg in messages:
    for word in msg.split(): # use set(msg.split()) to drop duplicates
        word_counts[word] += 1
        word_to_msgids[word].add(msgid)

for word, count in combined_word_counts.most_common(8):
    msgids = ', '.join(word_to_msgids[word])
    print '"{}" appears {} times in messages {}'.format(word, count, msgids)

输出

"the" appears 4 times in messages 364621, 364620, 364623, 364622
"one" appears 3 times in messages 364621, 364620, 364622
"for" appears 3 times in messages 364621, 364620, 364622
"and" appears 2 times in messages 364621, 364622
"yes" appears 2 times in messages 364618
"sir" appears 2 times in messages 364618
"baa" appears 2 times in messages 364616
"down" appears 1 times in messages 364623

注意：我认为您不需要对每封邮件中的单词进行单独计数。如果你真的需要它：

msgid_to_word_counts = {msgid:Counter(s.split()) for msgid, s in messages}

如果您想在'baa'中计算一次而不是两次的'baa baa black sheep'次，请使用set删除split()

结果中的重复项

msgid_to_word_counts = {msgid:Counter(set(s.split())) for msgid, s in messages}

Python - 频率最高的前3个单词

1 个答案: