我必须加入6组有关各种频道某些电视节目观看次数的数据。 6组数据中的3组包含节目列表和每个节目的观看量,例如:
Show_Name 201
Another_Show 105
依旧......
其他3组数据包含节目和每个节目播出的频道,例如:
Show_Name ABC
Another_Show CNN
依旧......
我在python中编写了以下Mapper,以便在ABC频道上找到:
#!/usr/bin/env python
import sys
all_shows_views = []
shows_on_ABC = []
for line in sys.stdin:
line = line.strip() #strip out carriage return (i.e. removes line breaks).
key_value = line.split(",") #split line into key and value, returns a list.
key_in = key_value[0] #.split(" ") - Dont need the split(" ") b/c there is no date.
value_in = key_value[1] #value is 2nd item.
if value_in.isdigit():
show = key_in
all_shows_views.append(show + "\t" + value_in)
if value_in == "ABC": #check if the TV Show is ABC.
show = key_in
shows_on_ABC.append(show)
for i in range(len(all_shows_views)):
show_view = all_shows_views[i].split("\t")
for c in range(len(shows_on_ABC)):
if show_view[0] == shows_on_ABC[c]:
print (show_view[0] + "\t" + show_view[1])
#Note that Hadoop expects a tab to separate key value
#but this program assumes the input file has a ',' separating key value.
Mapper只传递ABC上的节目名称和观看量,例如:
Show_name_on_ABC 120
还在python中的reducer如下:
prev_show = " " #initialize previous word to blank string
line_cnt = 0 #count input lines.
count = 0 #keep running total.
for line in sys.stdin:
line = line.strip() #strip out carriage return
key_value = line.split('\t') #split line, into key and value, returns a list
line_cnt = line_cnt+1
curr_show = key_value[0] #key is first item in list, indexed by 0
value_in = key_value[1] #value is 2nd item
if curr_show != prev_show and line_cnt>1:
#print "\n"
#print "---------------------Total---------------------"
#print "\n"
print (prev_show + "\t" + str(count))
#print "\n"
#print "------------------End of Item------------------"
#print "\n"
count = 0
else:
count = count + int(key_value[1])
#print key_value[0] + "\t" + key_value[1]
prev_show = curr_show #set up previous show for the next set of input lines.
print (curr_show + "\t" + str(count))
reducer使用ABC的节目列表和视图数量,并保持每个节目的计数平均值并打印出每个节目的总数(hadoop根据键名称自动按字母顺序排序数据。在这种情况下的节目)。
当我在终端中使用piping命令运行时,如下所示:
cat Data*.text | /home/cloudera/mapper.py |sort| /home/coudera/reducer.py
我得到一个整齐的输出,其正确的总数如下:
Almost_Games 49237
Almost_News 45589
Almost_Show 49186
Baked_Games 50603
当我使用以下命令在终端中使用Hadoop命令运行此问题时:
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_join \
-mapper /home/cloudera/mapper.py \
-reducer /home/cloudera/reducer.py
我得到一个不成功的错误,减速器是罪魁祸首。完整错误如下:
15/11/15 09:16:54 INFO mapreduce.Job: Job job_1447598349691_0003 failed with state FAILED due to: Task failed task_1447598349691_0003_r_000000
Job failed as tasks failed. failedMaps:0 failedReduces:1
15/11/15 09:16:54 INFO mapreduce.Job: Counters: 37
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=674742
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=113784
HDFS: Number of bytes written=0
HDFS: Number of read operations=18
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed reduce tasks=4
Launched map tasks=6
Launched reduce tasks=4
Data-local map tasks=6
Total time spent by all maps in occupied slots (ms)=53496
Total time spent by all reduces in occupied slots (ms)=18565
Total time spent by all map tasks (ms)=53496
Total time spent by all reduce tasks (ms)=18565
Total vcore-seconds taken by all map tasks=53496
Total vcore-seconds taken by all reduce tasks=18565
Total megabyte-seconds taken by all map tasks=54779904
Total megabyte-seconds taken by all reduce tasks=19010560
Map-Reduce Framework
Map input records=6600
Map output records=0
Map output bytes=0
Map output materialized bytes=36
Input split bytes=729
Combine input records=0
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=452
CPU time spent (ms)=4470
Physical memory (bytes) snapshot=1628909568
Virtual memory (bytes) snapshot=9392836608
Total committed heap usage (bytes)=1279262720
File Input Format Counters
Bytes Read=113055
15/11/15 09:16:54 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!
为什么管道命令工作而不是hadoop执行?
答案 0 :(得分:0)
看起来您没有正确使用hadoop流命令。而不是
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_join \
-mapper /home/cloudera/mapper.py \
-reducer /home/cloudera/reducer.py
在-mapper
中,您需要提供mapper 命令。尝试
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_join \
-mapper "python mapper.py" \
-reducer "python reducer.py" \
-file /home/cloudera/mapper.py \
-file /home/cloudera/reducer.py
同时通过打开跟踪网址上的任何失败任务来检查错误日志,因为上面的日志没什么帮助。
答案 1 :(得分:0)
Reducers Python脚本生成错误,因为变量curr_show仅在读取for
循环的行中声明。仅在使用Hadoop命令而不是管道命令时发生错误的原因是因为挖掘(我非常不熟悉)。
通过在for
循环之外声明curr_show变量,最终的打印命令能够执行。
prev_show = " " #initialize previous word to blank string
line_cnt = 0 #count input lines.
count = 0 #keep running total.
curr_show = " "
for line in sys.stdin:
line = line.strip() #strip out carriage return
key_value = line.split('\t') #split line, into key and value, returns a list
line_cnt = line_cnt+1
curr_show = key_value[0] #key is first item in list, indexed by 0
value_in = key_value[1] #value is 2nd item
if curr_show != prev_show and line_cnt>1:
#print "\n"
#print "---------------------Total---------------------"
#print "\n"
print (prev_show + "\t" + str(count))
#print "\n"
#print "------------------End of Item------------------"
#print "\n"
count = int(value_in)
else:
count = count + int(key_value[1])
#print key_value[0] + "\t" + key_value[1]
prev_show = curr_show #set up previous show for the next set of input lines.
print (curr_show + "\t" + str(count))
此外,count变量已更改为重置为当前value_in,以便show中更改时的当前值不会丢失。
答案 2 :(得分:-1)
此映射器和缩减器仍然无法正常工作。我得到以下异常。你们有没有发现这个问题?
用于此目的的命令是:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/input_join -output /user/cloudera/output_join2 -mapper '/home/cloudera/join2.mapper.py' -reducer '/home/cloudera/join2.reducer.py'
错误日志:
FATAL [IPC服务器处理程序5 on 51645] org.apache.hadoop.mapred.TaskAttemptListenerImpl:任务:attempt_1449644802746_0003_m_000001_0 - 退出:java.lang.RuntimeException:PipeMapRed.waitOutputThreads():子进程失败,代码为1 在org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322) 在org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535) 在org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130) 在org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) 在org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) 在org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) 在org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) 在org.apache.hadoop.mapred.YarnChild $ 2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) 在javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) 在org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)