你好,我是MapReduce的新手。
我正在尝试使用python在AWS上运行简单的MapReduce程序。我的映射器和化简器代码在本地似乎运行良好,但是当我尝试向集群中添加步骤以运行hadoop流时,该工作始终失败。
map.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print('%s\t%s' % (word, 1))
reduce.py
#!/usr/bin/env python
import sys
from operator import itemgetter;
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word
if current_word == word:
print('%s\t%s' % (current_word, current_count))
MapReduce run just fine locally
我正在使用GUI向AWS EMR集群添加新步骤,并将其参数转换如下:
hadoop-streaming -files s3://aws-logs-821627436605-us-east-1/map.py,s3://aws-logs-821627436605-us-east-1/reduce.py -mapper map.py -reducer reduce.py -input s3://aws-logs-821627436605-us-east-1/input/ -output s3://aws-logs-821627436605-us-east-1/output/
我需要知道我是否想念一些东西,谢谢您