Question

我对这段代码有一点问题，第一个作业运行良好，可以为每个客户添加我的价格总和，但是在第二个作业中，它应该对我的价格进行排序，并且我的控制台仍无法产生第一个作业，为什么？

from mrjob.job import MRJob
from mrjob.step import MRStep


class CustomerCount(MRJob):
    def steps(self):
        return [
            MRStep(mapper=self.mapper_initial,
                   reducer=self.reducer_initial),
            MRStep(mapper=self.mapper_sort,
                   reducer=self.reducer_sort)
        ]

    def mapper_initial(self, _, line):
        (customerID, price) = line.split(',')[0:3:2]
        yield customerID, float(price)

    def reducer_initial(self, customerID, prices):
        yield customerID, sum(prices)

    def mapper_sort(self, customerID, price):
        yield '%04.02f' % float(price), customerID

    def reducer_sort(self, price, customersID):
        for val in customersID:
            yield val, price


if __name__ == '__main__':
    CustomerCount.run()

数据行看起来像这样：（我对第一个和第二个元素感兴趣）：

44,8602,37.19
35,5368,65.89

Answer 1

您的第二个MR步骤映射器正在编写价格作为键。因此，将对具有相同键的所有客户ID进行排序，但不会按客户ID对数据进行排序。要验证此假设，请尝试与几个价格相同的客户一起运行。要按照您的要求获得输出，您可以在映射器上发送常量键（例如常量字符串“ 1”）并编写自定义比较器调用以对价格和客户ID进行排序。

为什么MapReduce的第二项工作无法排序？

1 个答案: