mapreduce用于Python中的单词频率

时间:2017-10-23 23:39:01

标签: python hadoop mapreduce mrjob

我希望我的python程序输出十大最常用词及其相关字数的列表。我必须使用mrjob - mapreduce来创建这个程序。我写了一个程序,找到单词的频率,并从最多到最少输出。但是我不知道如何输出前十个最常用的单词。我想也许我可以把它放在一个列表中并使用第二个map reducer进行排序,但我不知道如何使用mapreduce来做到这一点。我用mapreduce和python进行了新的编程。 有人可以给我任何建议吗?

from mrjob.job import MRJob
from mrjob.step import MRStep
import re

# Word frequency from book sorted by frequency
# File: book.txt  

# regular expression used to identify word
WORD_REGEXP = re.compile(r"[\w']+")

class MRWordFrequencyCount(MRJob):

    def steps(self):
        # 2 steps
        return [
            MRStep(mapper=self.mapper_get_words,
                   reducer=self.reducer_count_words),
            MRStep(mapper=self.mapper_make_counts_key,
                   reducer=self.reducer_output_words)
        ]

    # Step 1
    def mapper_get_words(self, _, line):
        words = WORD_REGEXP.findall(line)
        for w in words:
            yield w.lower(), 1

    def reducer_count_words(self, word, values):
        yield word, sum(values)

    # Step 2
    def mapper_make_counts_key(self, word, count):
        # sort by values
        yield '%04d' % int(count), word

    def reducer_output_words(self, count, words):
        # First Column is the count
        # Second Column is the word
        for word in words:
            yield count, word


if __name__ == '__main__':
    MRWordFrequencyCount.run()

1 个答案:

答案 0 :(得分:0)

您的结果是一个无序的键值集合。一种解决方案是转换为元组列表,因为您仍然可以维护单词和计数的数据关联,同时引入索引以进行排序。 https://docs.python.org/2/howto/sorting.html#sort-stability-and-complex-sorts 然后,您可以切掉最常见的前10个