正如标题所说,我需要编写一个代码,该代码返回频率最高的3个单词(来自输入字符串)的列表。这就是我到目前为止所做的:
IN:
import collections
print(sstr)
OUT:
['22574999', 'communication was sent']
['22582857', 'message originated from an industrial area in pacoima']
['22585166', 'your message will never be delivered']
['22585424', 'message has been delivered ']
在:
import collections
id = sstr[0]
info = (sstr[1]).split()
print(id,info)
OUT:
22574999 ['communication', 'was', 'sent']
22582857 ['message', 'originated', 'from', 'an', 'industrial', 'area', 'in', 'pacoima']
22585166 ['your', 'message', 'will', 'never', 'be', 'delivered']
22585424 ['message', 'has', 'been', 'delivered']
在:
import collections
id = sstr[0]
info = (sstr[1]).split()
c = collections.Counter()
for word in info:
c[word] += 1
print(c.most_common(3))
OUT:
Counter({'communication': 1, 'was': 1, 'sent': 1})
Counter({'message': 1, 'originated': 1, 'from': 1, 'an': 1, 'industrial': 1, 'area': 1, 'in': 1, 'pacoima': 1})
Counter({'your': 1, 'message': 1, 'will': 1, 'never': 1, 'be': 1, 'delivered': 1})
Counter({'message': 1, 'has': 1, 'been': 1, 'delivered': 1})
我希望将所有行合并为一个并找到频率最高的前3个单词。 如何找到最高频率前三个单词的id总和?
我想得到以下结果
结果:
top 3 words with highest frequency:
message :3
delivered:2
communication:1
sum of id in which there аре top 3 words with highest frequency:
message:3 Is included (22582857,22585166,22585424 )
delivered:2 Is included(22585166,22585424)
communication:1 Is included (22574999)
答案 0 :(得分:0)
from collections import Counter, defaultdict
messages = [
['364616', 'baa baa black sheep'],
['364617', 'have you any wool'],
['364618', 'yes sir yes sir'],
['364619', 'three bags full'],
['364620', 'one for the master'],
['364621', 'and one for the dame'],
['364622', 'and one for the little boy'],
['364623', 'who lives down the lane']]
word_counts = Counter()
word_to_msgids = defaultdict(set)
for msgid, msg in messages:
for word in msg.split(): # use set(msg.split()) to drop duplicates
word_counts[word] += 1
word_to_msgids[word].add(msgid)
for word, count in combined_word_counts.most_common(8):
msgids = ', '.join(word_to_msgids[word])
print '"{}" appears {} times in messages {}'.format(word, count, msgids)
输出
"the" appears 4 times in messages 364621, 364620, 364623, 364622
"one" appears 3 times in messages 364621, 364620, 364622
"for" appears 3 times in messages 364621, 364620, 364622
"and" appears 2 times in messages 364621, 364622
"yes" appears 2 times in messages 364618
"sir" appears 2 times in messages 364618
"baa" appears 2 times in messages 364616
"down" appears 1 times in messages 364623
注意:我认为您不需要对每封邮件中的单词进行单独计数。如果你真的需要它:
msgid_to_word_counts = {msgid:Counter(s.split()) for msgid, s in messages}
如果您想在'baa'
中计算一次而不是两次的'baa baa black sheep'
次,请使用set
删除split()
msgid_to_word_counts = {msgid:Counter(set(s.split())) for msgid, s in messages}