我对mapper的python脚本是这样的:
import sys,re,string
from os import listdir
from os.path import isfile, join
dir_path = "user/lexi/data/"
sys.path.append('.')
for filename in sys.stdin:
file_path = dir_path + filename.strip()
rfile = open(file_path,'r')
#cat = subprocess.Popen(["hadoop","fs","-cat",file_path],stdout=subprocess.PIPE)
for line in rfile.readlines():
do something
对于sys.stdin,它是一个 filenames.txt 文件,其中包含一个像这样的文件名列表:
123.html
124.html
125.html
...
所有这些html文件都在" user / lexi / data /"在hdfs。
然后我在hadoop流媒体中运行它:
hadoop jar /hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar\
-input 'input/filenames.txt'\
-output 'test-output'\
-file Mapper.py -file Reducer.py\
-mapper 'Mapper.py' -reducer 'Reducer.py'
但我得到了这个错误:
14/02/14 20:47:54 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/02/14 20:47:54 WARN snappy.LoadSnappy: Snappy native library not loaded
14/02/14 20:47:54 INFO mapred.FileInputFormat: Total input paths to process : 1
14/02/14 20:47:54 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop/mapred]
14/02/14 20:47:54 INFO streaming.StreamJob: Running job: job_201401161224_3630
14/02/14 20:47:54 INFO streaming.StreamJob: To kill this job, run:
14/02/14 20:47:54 INFO streaming.StreamJob: /usr/libexec/../bin/hadoop job -Dmapred.job.tracker=head:9000 -kill job_201401161224_3630
14/02/14 20:47:54 INFO streaming.StreamJob: Tracking URL: http://mri-head.mri.cs.gmu.edu:50030/jobdetails.jsp?jobid=job_201401161224_3630
14/02/14 20:47:55 INFO streaming.StreamJob: map 0% reduce 0%
14/02/14 20:48:26 INFO streaming.StreamJob: map 100% reduce 100%
14/02/14 20:48:26 INFO streaming.StreamJob: To kill this job, run:
14/02/14 20:48:26 INFO streaming.StreamJob: /usr/libexec/../bin/hadoop job -Dmapred.job.tracker=head:9000 -kill job_201401161224_3630
14/02/14 20:48:26 INFO streaming.StreamJob: Tracking URL: http://mri-head.mri.cs.gmu.edu:50030/jobdetails.jsp?jobid=job_201401161224_3630
14/02/14 20:48:26 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201401161224_3630_m_000001
14/02/14 20:48:26 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
有人可以帮助我摆脱这个??? 真的需要帮助!!!