使用MrJob在EMR上发生随机java.io.FileNotFoundException jobcache错误

时间:2013-01-22 03:16:01

标签: python hadoop emr mrjob

我正在使用MrJob并尝试在Elastic Map Reduce上运行Hadoop作业,这会随机崩溃。

数据看起来像这样(制表符分隔):

279391888       261151291       107.303163      35.468534
279391888       261115099       108.511726      35.503008
279391888       261151290       104.881560      35.278487
279391888       261151292       109.732004      35.659141
279391888       261266862       108.507754      35.434581
279391888       1687590146      59.118796       19.931201
279391888       269450882       58.909985       19.914108

基础MapReduce非常简单:

from mrjob.job import MRJob
import numpy as np

class CitypathsSummarize(MRJob):
  def mapper(self, _, line):
    orig, dest, minutes, dist = line.split()
    minutes = float(minutes)
    dist = float(dist)
    if minutes < .001:
      yield "crap", 1
    else:
      yield orig, dist/minutes

  def reducer(self, orig, speeds):
    speeds = list(speeds)
    mean = np.mean(speeds)
    yield orig, mean

if __name__ == "__main__":
  CitypathsSummarize.run()

当我运行它时,我使用以下命令,使用默认的mrjob.conf(我的密钥在环境中设置):

$ python summarize.py -r emr --ec2-instance-type c1.xlarge --num-ec2-instances 4 s3://citypaths/chicago-v4/ > chicago-v4-output.txt 

当我在小数据集上运行它时,它就完成了。当我在整个数据语料库(约10GiB值)上运行它时,我得到这样的错误(但不是每次都在同一点!):

Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-KCPTKZR5OX6D/task-attempts/attempt_201301211911_0001_m_000151_3/syslog):
java.io.FileNotFoundException: /mnt2/var/lib/hadoop/mapred/taskTracker/jobcache/job_201301211911_0001/attempt_201301211911_0001_m_000018_4/output/spill0.out
(while reading from s3://citypaths/chicago-v4/1613640660)
Terminating job flow: j-KCPTKZR5OX6D
Traceback (most recent call last):
  File "summarize.py", line 32, in <module>
    CitypathsSummarize.run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 545, in run
    mr_job.execute()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 561, in execute
    self.run_job()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 631, in run_job
    runner.run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 490, in run
    self._run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1048, in _run
    self._wait_for_job_to_complete()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1830, in _wait_for_job_to_complete
    raise Exception(msg)
Exception: Job on job flow j-KCPTKZR5OX6D failed with status SHUTTING_DOWN: Shut down as step failed
Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-KCPTKZR5OX6D/task-attempts/attempt_201301211911_0001_m_000151_3/syslog):
java.io.FileNotFoundException: /mnt2/var/lib/hadoop/mapred/taskTracker/jobcache/job_201301211911_0001/attempt_201301211911_0001_m_000018_4/output/spill0.out
(while reading from s3://citypaths/chicago-v4/1613640660)

我跑了两次;它在45分钟后第一次死亡,这次它在四小时后死亡。两次都死在不同的文件上。我已经检查了它死亡的两个文件,但没有任何问题。

不知何故,它找不到它写的溢出文件,这让我感到困惑。

编辑:

我再次运行这个工作,几个小时后它再次死亡,这次是一个不同的错误信息。

Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-3GGW2TSIKKW5R/task-attempts/attempt_201301310511_0001_m_001810_0/syslog):
Status Code: 403, AWS Request ID: 9E9E748A55BC6A58, AWS Error Code: RequestTimeTooSkewed, AWS Error Message: The difference between the request time and the current time is too large., S3 Extended Request ID: Ky+HVYZ8RsC3l5f9N3LTwyorY9bbqEnc4tRD/r/xfAHYP/eiQrjjcpmIDNY2eoDo
(while reading from s3://citypaths/chicago-v4/1439606131)

0 个答案:

没有答案