在Amazon AWS EMR集群上运行Hadoop流

时间:2019-05-24 15:43:50

标签: python amazon-web-services hadoop mapreduce amazon-emr

你好,我是MapReduce的新手。

我正在尝试使用python在AWS上运行简单的MapReduce程序。我的映射器和化简器代码在本地似乎运行良好,但是当我尝试向集群中添加步骤以运行hadoop流时,该工作始终失败。

map.py

#!/usr/bin/env python
import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print('%s\t%s' % (word, 1))

reduce.py

#!/usr/bin/env python
import sys
from operator import itemgetter;

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    if current_word == word:
        current_count += count
    else:
        if current_word:
            print('%s\t%s' % (current_word, current_count))
        current_count = count
        current_word = word

if current_word == word:
    print('%s\t%s' % (current_word, current_count))

MapReduce run just fine locally

我正在使用GUI向AWS EMR集群添加新步骤,并将其参数转换如下:

hadoop-streaming -files s3://aws-logs-821627436605-us-east-1/map.py,s3://aws-logs-821627436605-us-east-1/reduce.py -mapper map.py -reducer reduce.py -input s3://aws-logs-821627436605-us-east-1/input/ -output s3://aws-logs-821627436605-us-east-1/output/

我需要知道我是否想念一些东西,谢谢您

0 个答案:

没有答案