我正在尝试运行简单的地图缩减作业并获得以下数据集:
bike.txt
1 Bike 1
2 Bike 2
3 Bike 4
4 Bike 4
5 Bike 4
bikenames.txt
1,Aap
2,Noot
3,Greet
4,Mies
5,Gazelle
我的目标是编写一个mapreduce作业,它出自最常出现的自行车名称。因此我写了以下内容:
from mrjob.job import MRJob
from mrjob.step import MRStep
class MostPopularBike(MRJob):
def configure_options(self):
super(MostPopularBike, self).configure_options()
self.add_file_option('--items', help='Path to u.item')
def steps(self):
return [
MRStep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings),
MRStep(mapper = self.mapper_passthrough,
reducer = self.reducer_find_max)
]
def mapper_get_ratings(self, _, line):
(bikeID, name) = line.split('\t')
yield bikeID, 1
def reducer_init(self):
self.bikeNames = {}
with open("bikenames.txt`") as f:
for line in f:
fields = line.split(',')
self.bikeNames[fields[0]] = fields[1]
def reducer_count_ratings(self, key, values):
yield None, (sum(values), self.bikeNames[key])
def mapper_passthrough(self, key, value):
yield key, value
def reducer_find_max(self, key, values):
yield max(values)
if __name__ == '__main__':
MostPopularBike.run()
如果我尝试使用以下方式运行它:
!python MostPopularBike.py --items=bikenames.txt bike.txt
但它会产生以下错误:
AttributeError: 'MostPopularBike' object has no attribute 'bikeNames'
对这里出了什么问题的想法?
答案 0 :(得分:1)
bikeNames
仅在reducer_init()
中定义,因此不得调用此函数。无论如何,它并不是每个步骤的初始化函数;它看起来更像是工作的初始化。
在创建reducer_init
实例时,将函数名称从__init__
更改为MostPopularBike
以执行初始化。或者,如果您确实希望在每个步骤中执行初始化,请将steps
更新为:
def steps(self):
return [
MRStep(reducer_init=self.reducer_init,
mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings),
MRStep(reducer_init=self.reducer_init,
mapper = self.mapper_passthrough,
reducer = self.reducer_find_max)
]