我希望我的python程序输出十大最常用词及其相关字数的列表。我必须使用mrjob - mapreduce来创建这个程序。我写了一个程序,找到单词的频率,并从最多到最少输出。但是我不知道如何输出前十个最常用的单词。我想也许我可以把它放在一个列表中并使用第二个map reducer进行排序,但我不知道如何使用mapreduce来做到这一点。我用mapreduce和python进行了新的编程。 有人可以给我任何建议吗?
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
# Word frequency from book sorted by frequency
# File: book.txt
# regular expression used to identify word
WORD_REGEXP = re.compile(r"[\w']+")
class MRWordFrequencyCount(MRJob):
def steps(self):
# 2 steps
return [
MRStep(mapper=self.mapper_get_words,
reducer=self.reducer_count_words),
MRStep(mapper=self.mapper_make_counts_key,
reducer=self.reducer_output_words)
]
# Step 1
def mapper_get_words(self, _, line):
words = WORD_REGEXP.findall(line)
for w in words:
yield w.lower(), 1
def reducer_count_words(self, word, values):
yield word, sum(values)
# Step 2
def mapper_make_counts_key(self, word, count):
# sort by values
yield '%04d' % int(count), word
def reducer_output_words(self, count, words):
# First Column is the count
# Second Column is the word
for word in words:
yield count, word
if __name__ == '__main__':
MRWordFrequencyCount.run()
答案 0 :(得分:0)
您的结果是一个无序的键值集合。一种解决方案是转换为元组列表,因为您仍然可以维护单词和计数的数据关联,同时引入索引以进行排序。 https://docs.python.org/2/howto/sorting.html#sort-stability-and-complex-sorts 然后,您可以切掉最常见的前10个