当使用带有lxml库的hadoop时,Mrjob失败了

时间:2014-10-18 15:25:32

标签: lxml hadoop-streaming mrjob

我正在使用hadoop mapreduce开展项目。我的项目树在这张图片中显示:

MyProject
├── parse_xml_file.py
├── store_xml_directory
│   └── my_xml_file.xml
├── requirements.txt
├── input_to_hadoop.txt
└── testMrjob.py

使用命令在local中运行时,我运行时没有错误:

python testMrjob.py < input_to_hadoop.txt > output

但是当使用follow命令在hadoop上运行时:(所有节点都安装了lxml库)

python testMrjob.py -r hadoop --file parse_xml_file.py < input_to_hadoop.txt

或者

python testMrjob.py -r hadoop --file parse_xml_file.py --file store_xml_directory/my_xml_file.xml < input_to_hadoop.txt > output

我有错误:

no configs found; falling back on auto-configuration
creating tmp directory /tmp/testMrjob.haduser.20141018.152349.482573
Uploading input to hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/input
reading from STDIN
Copying non-input files into hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/
Using Hadoop version 1.2.1
HADOOP: Loaded the native-hadoop library
HADOOP: Snappy native library not loaded
HADOOP: Total input paths to process : 1
HADOOP: getLocalDirs(): [/opt/hadoop/dfs/mapred/local]
HADOOP: Running job: job_201410182107_0012
HADOOP: To kill this job, run:
HADOOP: /opt/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=master:54311 -kill job_201410182107_0012
HADOOP: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201410182107_0012
HADOOP:  map 0%  reduce 0%
HADOOP:  map 100%  reduce 100%
HADOOP: To kill this job, run:
HADOOP: /opt/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=master:54311 -kill job_201410182107_0012
HADOOP: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201410182107_0012
HADOOP: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201410182107_0012_m_000000
HADOOP: killJob...
HADOOP: Streaming Command Failed!
STDOUT: packageJobJar: [/opt/hadoop/tmp/hadoop-unjar9122722052766576889/] [] /tmp/streamjob2542718124608434574.jar tmpDir=null
Job failed with return code 1: ['/opt/hadoop/bin/hadoop', 'jar', '/opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar', '-files', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/testMrjob.py#testMrjob.py,hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/requirements.txt#requirements.txt,hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/parse_xml_file.py#parse_xml_file.py', '-archives', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/mrjob.tar.gz#mrjob.tar.gz', '-cmdenv', 'PYTHONPATH=mrjob.tar.gz', '-input', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/input', '-output', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/output', '-mapper', 'python testMrjob.py --step-num=0 --mapper', '-reducer', 'python testMrjob.py --step-num=0 --reducer']
Scanning logs for probable cause of failure
Traceback (most recent call last):
  File "testMrjob.py", line 25, in <module>
    MRWordFrequencyCount.run()
  File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 516, in run
    mr_job.execute()
  File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 532, in execute
    self.run_job()
  File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 602, in run_job
    runner.run()
  File "/usr/lib/python2.7/dist-packages/mrjob/runner.py", line 516, in run
    self._run()
  File "/usr/lib/python2.7/dist-packages/mrjob/hadoop.py", line 239, in _run
    self._run_job_in_hadoop()
  File "/usr/lib/python2.7/dist-packages/mrjob/hadoop.py", line 442, in _run_job_in_hadoop
    raise Exception(msg)
Exception: Job failed with return code 1: ['/opt/hadoop/bin/hadoop', 'jar', '/opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar', '-files', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/testMrjob.py#testMrjob.py,hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/requirements.txt#requirements.txt,hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/parse_xml_file.py#parse_xml_file.py', '-archives', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/mrjob.tar.gz#mrjob.tar.gz', '-cmdenv', 'PYTHONPATH=mrjob.tar.gz', '-input', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/input', '-output', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/output', '-mapper', 'python testMrjob.py --step-num=0 --mapper', '-reducer', 'python testMrjob.py --step-num=0 --reducer']

1 个答案:

答案 0 :(得分:0)

要使用mrjob传播python模块,您应该使用--python-archive而不是--file