mrjob add_file_arg()csv文件

时间:2018-05-18 23:27:07

标签: python python-3.x mrjob

我无法理解如何为mrjob使用add_file_arg()。我正在尝试使用人的属性将csv传递给我的映射器,并在我的映射器中找到每个人的属性。到目前为止,这是我的代码:

class MRPeopleScores(MRJob):
    def configure_args(self):
        super(MRPeopleScores, self).configure_args()
        self.add_file_arg('--database')

    def mapper(self, _, line):
        print(self.options.database)

当我跑步时

python3 calculate_people_scores.py --jobconf mapreduce.job.reduces=1 data/people_ids.csv database=data/people_attributes.csv

我收到以下错误消息:

Traceback (most recent call last):
File "calculate_people_scores.py", line 88, in <module>
MRPeopleScores.run()
File "/usr/local/lib/python3.6/site-packages/mrjob/job.py", line 439, in run
mr_job.execute()
File "/usr/local/lib/python3.6/site-packages/mrjob/job.py", line 460, in execute
super(MRJob, self).execute()
File "/usr/local/lib/python3.6/site-packages/mrjob/launch.py", line 161, in execute
self.run_job()
File "/usr/local/lib/python3.6/site-packages/mrjob/launch.py", line 231, in run_job
runner.run()
File "/usr/local/lib/python3.6/site-packages/mrjob/runner.py", line 476, in run
self._run()
File "/usr/local/lib/python3.6/site-packages/mrjob/sim.py", line 185, in _run
self._invoke_step(step_num, 'mapper')
File "/usr/local/lib/python3.6/site-packages/mrjob/sim.py", line 272, in _invoke_step
working_dir, env)
File "/usr/local/lib/python3.6/site-packages/mrjob/inline.py", line 154, in _run_step
child_instance.execute()
File "/usr/local/lib/python3.6/site-packages/mrjob/job.py", line 448, in execute
self.run_mapper(self.options.step_num)
File "/usr/local/lib/python3.6/site-packages/mrjob/job.py", line 526, in run_mapper
for out_key, out_value in mapper(key, value) or ():
File "calculate_people_scores.py", line 47, in mapper
print(self.options.database)
AttributeError: 'Values' object has no attribute 'database'

我确信我非常误解如何使用这个论点,我们非常感谢任何帮助。

1 个答案:

答案 0 :(得分:0)

我认为您应该先运行"C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\test_schedule.bat" 而不是--database data/people_attributes.csv

database=data/people_attributes.csv

通过传递python3 calculate_people_scores.py --jobconf mapreduce.job.reduces=1 data/people_ids.csv --database data/people_attributes.csv 只是传递文件的路径,因此您需要先打开文件,然后再执行任何操作。您可以通过以下方式覆盖--databasereducer_init()函数来做到这一点:

mapper_init()

def mapper_init(self):
    self.db=open(self.options.database)

现在您可以在映射器或化简器中使用def reducer_init(self): self.db=open(self.options.database) 。 建议不要使用self.db映射器或化简器输出,而应使用print。 另一方面,不建议在映射器或精简器中打印(或产生)第二个文件(yield),因为它将运行与映射器或精简器一样多的数量。但您可以这样操作:

self.db

最后,您可以像这样访问您的csv内容:

def mapper(self, _, line):
     print(self.db.readlines())
     # OR
     yield(None,self.db.readlines())