Question

我已经在AWS / EMR上准备了一个流媒体boto作业流程，使用熟悉的测试管道可以很好地运行：

 sed -n '0~10000p'  Big.csv | ./map.py | sort -t$'\t' -k1 | ./reduce.py

当我增加输入数据的大小时，boto emr作业运行也很有效，直到某些阈值，其中作业因python损坏的管道错误而失败：

 Traceback (most recent call last):
   File "/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201504151813_0001/attempt_201504151813_0001_r_000002_0/work/./reduce.py", line 18, in <module>
json.dump( { "cid":cur_key , "promo_hx":kc } , sys.stdout ) 
   File "/usr/lib/python2.6/json/__init__.py", line 181, in dump
fp.write(chunk)
 IOError: [Errno 32] Broken pipe

以及以下java错误：

  org.apache.hadoop.streaming.PipeMapRed (Thread-38): java.lang.OutOfMemoryError: Java heap space

我假设首先发生内存错误，导致管道损坏。

任何输入数据大小的映射任务都已完成;错误发生在reducer阶段。我的reducer是通常的流式缩减器（我使用AMI 3.2.3和Python 2.6.9中内置的jason包）：

 for line in sys.stdin:
      line                = line.strip()
      key  , value        = line.split('\t')
      ...
      print json.dumps( { "cid":cur_key , "promo_hx":kc } , sort_keys=True , separators=(',',': ') )

知道发生了什么事吗？感谢。

Answer 1

看来你需要增加reducer内存大小。这可以通过实例类型（按实例类型的默认值参见http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html）或通过调整作业级别或集群级别（http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#PredefinedbootstrapActions_ConfigureHadoop）的mapreduce.reduce.*属性来完成。

boto-emr作业错误：python破坏了管道错误和java.lang.OutOfMemoryError

1 个答案: