我正在使用hadoop mapreduce开展项目。我的项目树在这张图片中显示:
MyProject
├── parse_xml_file.py
├── store_xml_directory
│ └── my_xml_file.xml
├── requirements.txt
├── input_to_hadoop.txt
└── testMrjob.py
使用命令在local中运行时,我运行时没有错误:
python testMrjob.py < input_to_hadoop.txt > output
但是当使用follow命令在hadoop上运行时:(所有节点都安装了lxml库)
python testMrjob.py -r hadoop --file parse_xml_file.py < input_to_hadoop.txt
或者
python testMrjob.py -r hadoop --file parse_xml_file.py --file store_xml_directory/my_xml_file.xml < input_to_hadoop.txt > output
我有错误:
no configs found; falling back on auto-configuration
creating tmp directory /tmp/testMrjob.haduser.20141018.152349.482573
Uploading input to hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/input
reading from STDIN
Copying non-input files into hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/
Using Hadoop version 1.2.1
HADOOP: Loaded the native-hadoop library
HADOOP: Snappy native library not loaded
HADOOP: Total input paths to process : 1
HADOOP: getLocalDirs(): [/opt/hadoop/dfs/mapred/local]
HADOOP: Running job: job_201410182107_0012
HADOOP: To kill this job, run:
HADOOP: /opt/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=master:54311 -kill job_201410182107_0012
HADOOP: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201410182107_0012
HADOOP: map 0% reduce 0%
HADOOP: map 100% reduce 100%
HADOOP: To kill this job, run:
HADOOP: /opt/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=master:54311 -kill job_201410182107_0012
HADOOP: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201410182107_0012
HADOOP: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201410182107_0012_m_000000
HADOOP: killJob...
HADOOP: Streaming Command Failed!
STDOUT: packageJobJar: [/opt/hadoop/tmp/hadoop-unjar9122722052766576889/] [] /tmp/streamjob2542718124608434574.jar tmpDir=null
Job failed with return code 1: ['/opt/hadoop/bin/hadoop', 'jar', '/opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar', '-files', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/testMrjob.py#testMrjob.py,hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/requirements.txt#requirements.txt,hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/parse_xml_file.py#parse_xml_file.py', '-archives', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/mrjob.tar.gz#mrjob.tar.gz', '-cmdenv', 'PYTHONPATH=mrjob.tar.gz', '-input', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/input', '-output', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/output', '-mapper', 'python testMrjob.py --step-num=0 --mapper', '-reducer', 'python testMrjob.py --step-num=0 --reducer']
Scanning logs for probable cause of failure
Traceback (most recent call last):
File "testMrjob.py", line 25, in <module>
MRWordFrequencyCount.run()
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 516, in run
mr_job.execute()
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 532, in execute
self.run_job()
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 602, in run_job
runner.run()
File "/usr/lib/python2.7/dist-packages/mrjob/runner.py", line 516, in run
self._run()
File "/usr/lib/python2.7/dist-packages/mrjob/hadoop.py", line 239, in _run
self._run_job_in_hadoop()
File "/usr/lib/python2.7/dist-packages/mrjob/hadoop.py", line 442, in _run_job_in_hadoop
raise Exception(msg)
Exception: Job failed with return code 1: ['/opt/hadoop/bin/hadoop', 'jar', '/opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar', '-files', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/testMrjob.py#testMrjob.py,hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/requirements.txt#requirements.txt,hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/parse_xml_file.py#parse_xml_file.py', '-archives', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/mrjob.tar.gz#mrjob.tar.gz', '-cmdenv', 'PYTHONPATH=mrjob.tar.gz', '-input', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/input', '-output', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/output', '-mapper', 'python testMrjob.py --step-num=0 --mapper', '-reducer', 'python testMrjob.py --step-num=0 --reducer']
答案 0 :(得分:0)
要使用mrjob传播python模块,您应该使用--python-archive
而不是--file
。