可以将任何算法实现到hadoop流mapreduce工作吗?

时间:2014-05-14 00:53:01

标签: python algorithm hadoop mapreduce streaming

我有一个数据集:

用户功能1功能2功能3功能4 ...

user1 f11 f12 f13 f14 ...

user2 f21 f22 f23 f24 ...

我有一个算法应用于此数据集,因此,对于每个用户,我们可以计算此用户与其他用户之间的相似性得分:

      score{user_i}=algorithm(dict{user_i},dict{user_k})

dict {user_i} = [f11,f12,f13,f14]是一个哈希值。

对于每个用户,在我们计算出用户与所有其他用户之间的相似度之后,我们按降序对相似度得分进行排序,并给出输出。

这是reducer.py:

#!/usr/bin/env python

import random, csv,sys;

def similarity(list1,list2):
  list3=[0,0,0,0,0,0,0,0,0,0,0]
  list4=[0,0,0,0,0,0,0,0,0,0,0]      
  if len(list1)>=5:
     if list1[3]==list2[3]:
        list3[3]=1
        list4[3]=1
     else:
        list3[3]=0
        list4[3]=1
    if list1[4]==list2[4]:
        list3[4]=1 
        list4[4]=1
    else:
        list3[4]=0 
        list4[4]=1
     if list1[5]==list2[5]:
        list3[5]=1
        list4[5]=1
     else:
        list3[5]=0
        list4[5]=1
     if list1[6]!="\N" and list2[6]!="\N" and abs(float(list1[6].split("/")            [2][0:4])-float(list2[6].split("/")[2][0:4]))<=2:
    list3[6]=1
    list4[6]=1
 else:
    list3[6]=0
    list4[6]=1
 if list1[7]!="\N" and list1[7]!="\N" and abs(float(list1[7])-float(list2[7]))<=20:
    list3[7]=1 
    list4[7]=1
 else:
    list3[7]=0 
    list4[7]=1
 list3[8]=1
 list4[8]=1
 if list1[9]!="\N" and list2[9]!="\N" and list1[9]!="" and list2[9]!="" and abs(float(list1[9])-float(list2[9]))<=20:
    list3[9]=1 
    list4[9]=1
 else:
    list3[9]=0 
    list4[9]=1
 if list1[10]!="\N" and list2[10]!="\N" and list1[10]!=0 and list2[10]!=0 and abs(float(list1[10])-float(list2[10]))<=3:
    list3[10]=1
    list4[10]=1
 else:
    list3[10]=0
    list4[10]=1
 set_1=list3[3:11]
 set_2=list4[3:11]
 inter_len=0
 noninter_len=0
 for i in range(len(set_1)):
   if set_1[i]==set_2[i]:
      inter_len=inter_len+1
   if set_1[i]!=set_2[i]:
      noninter_len=noninter_len+1
 jaccard=inter_len/float(inter_len+noninter_len) 
 if list1[0]==list2[0]:
    genre=1
 elif list1[0][0:6]==list2[0][0:6]:
    genre=0.5
 else:
    genre=0
 if list1[1]==list2[1]:
    rating=1
 elif list1[1][0:2]==list2[1][0:2]:
    rating=0.5
 else:
    rating=0
 if list1[2]!="" and list2[2]!="" and len(set.intersection(set(list1[2].split(",")),set(list2[2].split(","))))>0:
    target=1
 else:
    target=0
 return jaccard+genre+rating+target
  else:
    print "Trim data incomplete"





it=0
trim_id=sys.argv[0]

dict={ }
score={ }

 for line in sys.stdin:
    line=line.strip().split("\t")
      dict[line[0]]=line[1:12]


keylist=dict.keys()
keylist.sort()

for key in keylist:
    if key!=trim_id:
       score[key]=similarity(dict[key],dict[trim_id])

iter=0
for key, value in sorted(score.iteritems(), key=lambda (k,v): (v,k),reverse=True):
    print "%s" % (key)
    iter=iter+1
    if iter>=10:
       break

这是hadoop流的bash文件:

hadoop fs -rmr /tmp/somec/some/

hadoop jar *.jar \
       -input /user/hive/warehouse/fb_text/ \
       -output /tmp/somec/some/ \
       -mapper "cat" \
       -reducer "jac.py" \
       -file jac.py \

fb_text是制表符分隔符。这很好。我测试了一个字数hadoop流媒体工作。它运行顺利。

这是hadoop流媒体错误:

rmr: DEPRECATED: Please use 'rm -r' instead.
14/05/14 00:31:55 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion     interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /tmp/somec/some
14/05/14 00:31:57 WARN streaming.StreamJob: -file option is deprecated, please use     generic option -files instead.
packageJobJar: [jac.py] [/opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-    mapreduce/hadoop-streaming-2.2.0-cdh5.0.0-beta-2.jar]     /tmp/streamjob3048667246321733915.jar tmpDir=null
14/05/14 00:31:58 INFO client.RMProxy: Connecting to ResourceManager at     ip-10-0-0-190.us-west-2.compute.internal/10.0.0.190:8032
14/05/14 00:31:59 INFO client.RMProxy: Connecting to ResourceManager at    ip-10-0-0-190.us-west-2.compute.internal/10.0.0.190:8032
14/05/14 00:32:02 INFO mapred.FileInputFormat: Total input paths to process : 1
14/05/14 00:32:04 INFO mapreduce.JobSubmitter: number of splits:2
14/05/14 00:32:04 INFO mapreduce.JobSubmitter: Submitting tokens for job:     job_1399599059169_0110
14/05/14 00:32:05 INFO impl.YarnClientImpl: Submitted application     application_1399599059169_0110
14/05/14 00:32:05 INFO mapreduce.Job: The url to track the job: http://ip-    10-0-0-190.us-west-2.compute.internal:8088/proxy/application_1399599059169_0110/
14/05/14 00:32:05 INFO mapreduce.Job: Running job: job_1399599059169_0110
14/05/14 00:32:13 INFO mapreduce.Job: Job job_1399599059169_0110 running in uber mode : false
14/05/14 00:32:13 INFO mapreduce.Job:  map 0% reduce 0%
14/05/14 00:32:19 INFO mapreduce.Job:  map 50% reduce 0%
14/05/14 00:32:20 INFO mapreduce.Job:  map 100% reduce 0%
14/05/14 00:32:26 INFO mapreduce.Job: Task Id : attempt_1399599059169_0110_r_000001_0,     Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed     with code 127
        at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
        at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
        at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)

14/05/14 00:32:26 INFO mapreduce.Job: Task Id :     attempt_1399599059169_0110_r_000003_0,     Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127
        at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
        at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
        at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:160)

我想知道为什么。

我的hadoop流媒体罐很好。我测试了一个单词计数示例,它运行顺利。

这个python代码在本地linux机器上运行良好。

1 个答案:

答案 0 :(得分:1)

您只看到屏幕上的一半错误。它基本上说&#34; python脚本失败&#34;。

您需要转到作业跟踪器UI,找到作业,单击失败的地图任务并查看日志。希望Python给stderr写了一些东西来帮助你。

如需额外调试,请考虑添加一些有用的&#34; println&#34;脚本中的消息。

本地测试的一个好方法不仅仅是运行Python脚本,而是以与Streaming将使用它类似的方式运行它。尝试:

cat数据| map.py |排序| reduce.py

最后: mapper和reducer的输出应为\ t(即键和值由制表符分隔)。