我正在尝试解决Hadoop流中的倒排单词列表问题(对于每个单词,输出是包含该单词的文件名列表)。输入是包含文本文件的目录的名称。我已经用python编写了mapper和reducer,在尝试使用unix管道时它们可以正常工作。但是,使用Hadoop流命令执行时,代码会运行,但最终作业会失败。我怀疑这是Mapper代码中的内容,但似乎无法确切知道问题所在。
我是初学者(因此,如果我做的不好,请原谅),并在VMware Fusion上使用Cloudera培训。我的Mapper和Reducer .py可执行文件放在本地系统以及hdfs的主目录中。我在hdfs上有目录“莎士比亚”。下面的unix pipe命令可以正常工作。
回音莎士比亚| ./InvertedMapper.py |排序./InvertedReducer.py但是,haddop流没有。
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming*.jar-输入莎士比亚-输出InvertedList -mapper InvertedMapper.py -reducer InvertedReducer.py -file InvertedMapper.py -file InvertedReducer .py
#MAPPER CODE
#!/usr/bin/env python
import sys
import os
class Mapper(object):
def __init__(self, stream, sep='\t'):
self.stream=stream
self.sep=sep
def __iter__(self):
os.chdir(self.stream.read().strip())
files = [os.path.abspath(f) for f in os.listdir(".")]
for file in files:
yield file
def emit(self, key, value):
sys.stdout.write("{0}{1}{2}\n".format(key,self.sep,value))
def map(self):
for file in self:
with open(file) as infile:
name = file.split("/")[-1].split(".")[0]
words = infile.read().strip().split()
for word in words:
self.emit(word,name)
if __name__ == "__main__":
cwd = os.getcwd()
mapper = Mapper(sys.stdin)
mapper.map()
os.chdir(cwd)
#REDUCER CODE
#!/usr/bin/env python
import sys
from itertools import groupby
from operator import itemgetter
class Reducer(object):
def __init__(self, stream, sep="\t"):
self.stream = stream
self.sep = sep
def __iter__(self):
for line in self.stream:
try:
parts = line.strip().split(self.sep)
yield parts[0], parts[1]
except:
continue
def emit(self, key, value):
sys.stdout.write("{0}{1}{2}\n".format(key, self.sep, value))
def reduce(self):
for key, group in groupby(self, itemgetter(0)):
values = []
for item in group:
values.append(item[1])
values = set(values)
values = list(values)
self.emit(key, values)
if __name__ == "__main__":
reducer = Reducer(sys.stdin)
reducer.reduce()
运行Hadoop命令的输出如下。
packageJobJar: [InvertedMapper1.py, /tmp/hadoop-training/hadoop-unjar281431668511629942/] [] /tmp/streamjob679048425003800890.jar tmpDir=null
19/02/17 00:22:19 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
19/02/17 00:22:19 INFO mapred.FileInputFormat: Total input paths to process : 5
19/02/17 00:22:20 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/training/mapred/local]
19/02/17 00:22:20 INFO streaming.StreamJob: Running job: job_201902041621_0051
19/02/17 00:22:20 INFO streaming.StreamJob: To kill this job, run:
19/02/17 00:22:20 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201902041621_0051
19/02/17 00:22:20 INFO streaming.StreamJob: Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201902041621_0051
19/02/17 00:22:21 INFO streaming.StreamJob: map 0% reduce 0%
19/02/17 00:22:34 INFO streaming.StreamJob: map 40% reduce 0%
19/02/17 00:22:39 INFO streaming.StreamJob: map 0% reduce 0%
19/02/17 00:22:50 INFO streaming.StreamJob: map 40% reduce 0%
19/02/17 00:22:53 INFO streaming.StreamJob: map 0% reduce 0%
19/02/17 00:23:03 INFO streaming.StreamJob: map 40% reduce 0%
19/02/17 00:23:06 INFO streaming.StreamJob: map 20% reduce 0%
19/02/17 00:23:07 INFO streaming.StreamJob: map 0% reduce 0%
19/02/17 00:23:16 INFO streaming.StreamJob: map 20% reduce 0%
19/02/17 00:23:17 INFO streaming.StreamJob: map 40% reduce 0%
19/02/17 00:23:19 INFO streaming.StreamJob: map 20% reduce 0%
19/02/17 00:23:21 INFO streaming.StreamJob: map 100% reduce 100%
19/02/17 00:23:21 INFO streaming.StreamJob: To kill this job, run:
19/02/17 00:23:21 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201902041621_0051
19/02/17 00:23:21 INFO streaming.StreamJob: Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201902041621_0051
19/02/17 00:23:21 ERROR streaming.StreamJob: Job not successful. Error: NA
19/02/17 00:23:21 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
答案 0 :(得分:0)
我不知道这是否是您的代码失败的原因,但常见问题解答指出不应在 Hadoop Streaming 中使用 unix 管道。