Python 3.6.3 :: Anaconda custom(64位)
mrjob == 0.6.2没有自定义配置
本地运行
我正在为本地地图缩减作业实施基本字数统计示例。我的映射器使用简单的正则表达式从.txt
文件将书中的每一行中的1映射到1。 reducer计算每个单词的出现次数,即每个单词的1个数。
from mrjob.job import MRJob
import re
WORD_REGEXP = re.compile(r"[\w']+")
class WordCounter(MRJob):
def mapper(self, _, line):
words = WORD_REGEXP.findall(line)
for word in words:
yield word.lower(), 1
def reducer(self, word, times_seen):
yield word, sum(times_seen)
if __name__ == '__main__':
WordCounter.run()
输出文件正确但键值对未全局排序。似乎结果只按字母顺序排列在数据块中。
"customers'" 1
"customizing" 1
"cut" 2
"cycle" 1
"cycles" 1
"d" 10
"dad" 1
"dada" 1
"daily" 3
"damage" 1
"deductible" 6
...
"exchange" 10
"excited" 4
"excitement" 1
"exciting" 4
"executive" 2
"executives" 2
"theft" 1
"their" 122
"them" 166
"theme" 2
"themselves" 16
"then" 59
"there" 144
"they've" 2
...
"anecdotes" 1
"angel" 1
"angie's" 1
"angry" 1
"announce" 2
"announced" 1
"announcement" 3
"announcements" 3
"announcing" 2
...
"patents" 3
"path" 19
"paths" 1
"patterns" 1
"pay" 45
"exercise" 1
"exercises" 1
"exist" 6
"expansion" 1
"expect" 11
"expectation" 3
"expectations" 5
"expected" 4
....
"customer" 41
"customers" 122
"yours" 15
"yourself" 78
"youth" 1
"zealand" 1
"zero" 7
"zoho" 1
"zone" 2
是否需要进行一些初始配置才能从MRJob获取全局排序的输出?
答案 0 :(得分:0)
您缺少组合器步骤,在本指南中它是单步作业的第一个示例:https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html
我将复制代码以获得此答案的完整性:
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield word.lower(), 1
def combiner(self, word, counts):
yield word, sum(counts)
def reducer(self, word, counts):
yield word, sum(counts)
if __name__ == '__main__':
MRWordFreqCount.run()