我正在使用MRjob在HBase实例上运行Hadoop Streaming作业。对于我的生活,我无法弄清楚如何将参数传递给我的reducer。我有两个参数,我想从运行作业时传递给我的reducer:startDate和endDate。这是我目前的减速机的样子:
def reducer(self, groupId, meterList):
"""
Print bucket.
"""
sys.stderr.write("Working on group = " + str(groupId) + "\n")
#print "Opening connection..."
conn = open_connection(hostname)
#print "Getting table..."
table = get_table(conn, tableName)
compositeDf = DataFrame()
for meterId in meterList:
sys.stderr.write("Querying: " + str(meterId) + "\n")
df = extract_meter_data(table, meterId, startDate, endDate)
我似乎无法将startDate和endDate作为参数传递给我的reducer。我能够获得参数的唯一方法是通过类顶部的全局变量。
startDate = datetime.datetime(2012, 6, 10)
endDate = datetime.datetime(2012, 6, 11)
class MRDataQuality(MRJob):
"""
MapReduce job that does a data quality check on the meter data in HBase.
"""
但那很脏。我想通过调用这份工作来传递它。我尝试了很多方法。将其设置为实例变量,将其设置为静态类变量,为MRDataQualityJob创建重载的构造函数......似乎没有任何效果。我通过编程方式从我的顶级脚本中调用它:
if args.hadoop:
mrdq_job = MRDataQuality(args=['-r', 'hadoop', '--conf-path', 'mrjob.conf', '--jobconf', 'mapred.reduce.tasks=42', meterFile])
else:
mrdq_job = MRDataQuality(args=[meterFile])
with mrdq_job.make_runner() as runner:
runner.run()
无论我对mrdq_job实例做什么,似乎runner.run()似乎正在使用一个没有定义实例或静态变量的类的新实例。如何将我的参数传递给reducer ????我可以通过传递一个字符串:“ - reducer reducer.py arg1 arg2”在常规Hadoop Streaming中完成。 MRjob有没有相应的东西?
答案 0 :(得分:3)
如何将参数传递给作业配置,然后使用get_jobconf_value读取它们?
这样的事情:
from mrjob.compat import get_jobconf_value
class MRDataQuality(MRJob):
def reducer(self, groupId, meterList):
...
startDate = get_jobconf_value("my.job.settings.startdate")
endDate = get_jobconf_value("my.job.settings.enddate")
for meterId in meterList:
sys.stderr.write("Querying: " + str(meterId) + "\n")
df = extract_meter_data(table, meterId, startDate, endDate)
然后像上面一样在代码中设置参数
mrdq_job = MRDataQuality(args=['-r', 'hadoop', '--conf-path', 'mrjob.conf', '--jobconf', 'mapred.reduce.tasks=42', '--jobconf', 'my.job.settings.startdate=2013-06-10', '--jobconf', 'my.job.settings.enddate=2013-06-11', meterFile])
答案 1 :(得分:1)
如何将参数传递给作业配置,然后使用reducer_init中的get_jobconf_value读取它们?这样您只需要一次读取参数。
这样的事情:
from mrjob.compat import get_jobconf_value
class MRDataQuality(MRJob):
def reducer_init(self):
...
self.startDate = get_jobconf_value("my.job.settings.startdate")
self.endDate = get_jobconf_value("my.job.settings.enddate")
def reducer(self, groupId, meterList):
for meterId in meterList:
sys.stderr.write("Querying: " + str(meterId) + "\n")
df = extract_meter_data(table, meterId, self.startDate, self.endDate)
然后像上面一样在代码中设置参数
mrdq_job = MRDataQuality(args=['-r', 'hadoop', '--conf-path', 'mrjob.conf', '--jobconf', 'mapred.reduce.tasks=42', '--jobconf', 'my.job.settings.startdate=2013-06-10', '--jobconf', 'my.job.settings.enddate=2013-06-11', meterFile])