使用Hadoop Streaming时如何从python脚本中的hdfs读取文件?

时间:2014-02-15 02:09:20

标签: hdfs

我对mapper的python脚本是这样的:

import sys,re,string
from os import listdir
from os.path import isfile, join
dir_path = "user/lexi/data/"
sys.path.append('.')
for filename in sys.stdin:
    file_path = dir_path + filename.strip()
    rfile = open(file_path,'r')
    #cat = subprocess.Popen(["hadoop","fs","-cat",file_path],stdout=subprocess.PIPE)
    for line in rfile.readlines():
           do something

对于sys.stdin,它是一个 filenames.txt 文件,其中包含一个像这样的文件名列表:

123.html
124.html
125.html
...

所有这些html文件都在" user / lexi / data /"在hdfs。

然后我在hadoop流媒体中运行它:

hadoop jar /hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar\
 -input 'input/filenames.txt'\
 -output 'test-output'\
 -file Mapper.py -file Reducer.py\
 -mapper 'Mapper.py' -reducer 'Reducer.py'

但我得到了这个错误:

14/02/14 20:47:54 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/02/14 20:47:54 WARN snappy.LoadSnappy: Snappy native library not loaded
14/02/14 20:47:54 INFO mapred.FileInputFormat: Total input paths to process : 1
14/02/14 20:47:54 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop/mapred]
14/02/14 20:47:54 INFO streaming.StreamJob: Running job: job_201401161224_3630
14/02/14 20:47:54 INFO streaming.StreamJob: To kill this job, run:
14/02/14 20:47:54 INFO streaming.StreamJob: /usr/libexec/../bin/hadoop job   -Dmapred.job.tracker=head:9000 -kill job_201401161224_3630
14/02/14 20:47:54 INFO streaming.StreamJob: Tracking URL: http://mri-head.mri.cs.gmu.edu:50030/jobdetails.jsp?jobid=job_201401161224_3630
14/02/14 20:47:55 INFO streaming.StreamJob:  map 0%  reduce 0%
14/02/14 20:48:26 INFO streaming.StreamJob:  map 100%  reduce 100%
14/02/14 20:48:26 INFO streaming.StreamJob: To kill this job, run:
14/02/14 20:48:26 INFO streaming.StreamJob: /usr/libexec/../bin/hadoop job  -Dmapred.job.tracker=head:9000 -kill job_201401161224_3630
14/02/14 20:48:26 INFO streaming.StreamJob: Tracking URL: http://mri-head.mri.cs.gmu.edu:50030/jobdetails.jsp?jobid=job_201401161224_3630
14/02/14 20:48:26 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201401161224_3630_m_000001
14/02/14 20:48:26 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

有人可以帮助我摆脱这个??? 真的需要帮助!!!

0 个答案:

没有答案