Hadoop流式调用来调用python脚本

时间:2013-04-29 10:37:16

标签: python python-2.7 hadoop hadoop-streaming

我有两个小的python脚本

CountWordOccurence_mapper.py

#!/usr/bin/env python
import sys

#print(sys.argv[1])

text = sys.argv[1]

wordCount = text.count(sys.argv[2])

#print (sys.argv[2],wordCount)
print '%s\t%s' % (sys.argv[2], wordCount)

PrintWordCount_reducer.py

#!/usr/bin/env python
import sys

finalCount = 0

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t')

    count=int(count)

    finalCount += count

    print(word,finalCount)

我执行了以下相同的操作:

$ ./CountWordOccurence_mapper.py \
    "I am a Honda customer 100%.. 94 Accord ex 96 Accord exV6 98 Accord exv6 cpe 2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower motor blown 2months after the warranty expired. Sad $600 didn't last very long." \
    "Accord" \
     | /home/hadoopranch/omkar/PrintWordCount_reducer.py
('Accord', 4)

如上所见,我的目标是愚蠢的 - 算不上。在给定文本中出现所提供的单词(在本例中为Accord)。

现在,我打算使用Hadoop流执行相同的操作。 HDFS(部分)上的文本文件是:

"message" : "I am a Honda customer 100%.. 94 Accord ex   96 Accord exV6  98 Accord exv6 cpe  2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower   motor blown 2months after the warranty expired. Sad $600 didn't last very long."
"message" : "I am an angry Honda owner! In 2009 I bought a new Honda Civic and have taken great care of it.  Yesterday   I tried to start it unsuccessfully.  After hours at the auto mechanics  it was found that there was a glitch in the electric/computer system.  The news was disappointing enough (and expensive)    but to find out the problem is basically a defect/common problem with the year/make/model I purchased is awful.  When I bought a NEW Honda I thought I bought quality.  I was wrong! Will Honda step up?"

我修改了CountWordOccurence_mapper.py

#!/usr/bin/env python
import sys

for text in sys.stdin:  

    wordCount = text.count(sys.argv[1])

    print '%s\t%s' % (sys.argv[1], wordCount)

我的第一个困惑是 - 如何发送要计算的单词,例如“Accord”,“Honda”作为映射器的参数(-cmdenv name = value)只会让我困惑。我仍然继续执行以下命令:

$HADOOP_HOME/bin/hadoop jar \
  $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.4.jar \
 -input /random_data/honda_service_within_warranty.txt \
 -output /random_op/cnt.txt \
 -file /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py \
 -mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord" \
 -file /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py \
 -reducer /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py

正如预期的那样,作业失败了,我收到了以下错误:

Traceback (most recent call last):
  File "/tmp/hadoop-hduser/mapred/local/taskTracker/hduser/jobcache/job_201304232210_0007/attempt_201304232210_0007_m_000001_3/work/./CountWordOccurence_mapper.py", line 6, in <module>
    wordCount = text.count(sys.argv[1])
IndexError: list index out of range
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

请纠正我所犯的语法和基本错误。

谢谢和问候!

2 个答案:

答案 0 :(得分:1)

我认为您的问题在于命令行调用的以下部分:

-mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord"

我认为你的假设是字符串“Accord”作为第一个参数传递给mapper。我很确定情况并非如此,实际上字符串“Accord”很可能被流式驱动程序入口点类(StreamJob.java)忽略。

要解决这个问题,你需要回到使用-cmdenv参数然后在你的python代码中提取这个键/值对(我不是一个python程序员,但我确定一个快速的谷歌会指出你需要的片段。

答案 1 :(得分:0)

实际上,我曾尝试使用-cmdenv wordToBeCounted =“Accord”但问题出在我的python文件中 - 我忘了对它进行更改以从环境变量中读取值“Accord”(而不是从参数数组中读取) )。 我正在附加CountWordOccurence_mapper.py的代码,以防万一有人想用它作为参考:

#!/usr/bin/env python
import sys
import os

wordToBeCounted = os.environ['wordToBeCounted']

for text in sys.stdin:  

    wordCount = text.count(wordToBeCounted)

    print '%s\t%s' % (wordToBeCounted,wordCount)

谢谢和问候!