声明mrjob映射器而不忽略键

时间:2015-11-16 22:38:54

标签: python hadoop mapreduce mrjob

我想用mrjob声明一个mapper函数。因为我的mapper函数需要引用一些常量来进行一些计算,所以我决定将这些常量放入映射器的Key中(还有其他方法吗?)。我在this site上阅读了mrjob教程,但所有示例都忽略了密钥。例如:

class MRWordFrequencyCount(MRJob):

def mapper(self, _, line):
    yield "chars", len(line)
    yield "words", len(line.split())
    yield "lines", 1

def reducer(self, key, values):
    yield key, sum(values)

基本上,我喜欢以下内容:

def mapper(self, (constant1,constant2,constant3,constant4,constant5), line):
    My calculation goes here

请建议我怎么做。谢谢

1 个答案:

答案 0 :(得分:2)

您可以在__init__

中设置常量
from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, key, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1
        yield "Constant",self.constant

    def reducer(self, key, values):
        yield key, sum(values)

    def __init__(self,*args,**kwargs):
        super(MRWordFrequencyCount, self).__init__(*args, **kwargs)
        self.constant = 10


if __name__ == '__main__':
    MRWordFrequencyCount.run()

输出:

"Constant"  10
"chars" 12
"lines" 1
"words" 2

或者,您可以使用RawProtocol

from mrjob.job import MRJob
import mrjob


class MRWordFrequencyCount(MRJob):
    INPUT_PROTOCOL = mrjob.protocol.RawProtocol

    def mapper(self, key, line):
        yield "constant", key
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        if str(key) != "constant":
            yield key, sum(values)
        else:
            yield "constant",list(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

如果输入为:

constant1,constant2,constant3   The quick brown fox jumps over the lazy dog

输出:

"chars" 43
"constant"  ["constant1,constant2,constant3"]
"lines" 1
"words" 9