Python3 MRJob输出未排序的键值对

时间:2018-05-09 01:44:08

标签: python python-3.x mapreduce mrjob

上下文

Python 3.6.3 :: Anaconda custom(64位)
mrjob == 0.6.2没有自定义配置
本地运行

我正在为本地地图缩减作业实施基本字数统计示例。我的映射器使用简单的正则表达式从.txt文件将书中的每一行中的1映射到1。 reducer计算每个单词的出现次数,即每个单词的1个数。

from mrjob.job import MRJob
import re

WORD_REGEXP = re.compile(r"[\w']+")

class WordCounter(MRJob):
  def mapper(self, _, line):
    words = WORD_REGEXP.findall(line)
    for word in words:
      yield word.lower(), 1

  def reducer(self, word, times_seen):
    yield word, sum(times_seen)

if __name__ == '__main__':
  WordCounter.run()

问题

输出文件正确但键值对未全局排序。似乎结果只按字母顺序排列在数据块中。

"customers'"    1
"customizing"   1
"cut"   2
"cycle" 1
"cycles"    1
"d" 10
"dad"   1
"dada"  1
"daily" 3
"damage"    1
"deductible"    6
...
"exchange"  10
"excited"   4
"excitement"    1
"exciting"  4
"executive" 2
"executives"    2
"theft" 1
"their" 122
"them"  166
"theme" 2
"themselves"    16
"then"  59
"there" 144
"they've"   2
...
"anecdotes" 1
"angel" 1
"angie's"   1
"angry" 1
"announce"  2
"announced" 1
"announcement"  3
"announcements" 3
"announcing"    2
...
"patents"   3
"path"  19
"paths" 1
"patterns"  1
"pay"   45
"exercise"  1
"exercises" 1
"exist" 6
"expansion" 1
"expect"    11
"expectation"   3
"expectations"  5
"expected"  4
....
"customer"  41
"customers" 122
"yours" 15
"yourself"  78
"youth" 1
"zealand"   1
"zero"  7
"zoho"  1
"zone"  2

问题

  

是否需要进行一些初始配置才能从MRJob获取全局排序的输出?

1 个答案:

答案 0 :(得分:0)

您缺少组合器步骤,在本指南中它是单步作业的第一个示例:https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html

我将复制代码以获得此答案的完整性:

from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer(self, word, counts):
        yield word, sum(counts)


if __name__ == '__main__':
    MRWordFreqCount.run()