Question

我正在使用MrJob并尝试在Elastic Map Reduce上运行Hadoop作业，这会随机崩溃。

数据看起来像这样（制表符分隔）：

279391888       261151291       107.303163      35.468534
279391888       261115099       108.511726      35.503008
279391888       261151290       104.881560      35.278487
279391888       261151292       109.732004      35.659141
279391888       261266862       108.507754      35.434581
279391888       1687590146      59.118796       19.931201
279391888       269450882       58.909985       19.914108

基础MapReduce非常简单：

from mrjob.job import MRJob
import numpy as np

class CitypathsSummarize(MRJob):
  def mapper(self, _, line):
    orig, dest, minutes, dist = line.split()
    minutes = float(minutes)
    dist = float(dist)
    if minutes < .001:
      yield "crap", 1
    else:
      yield orig, dist/minutes

  def reducer(self, orig, speeds):
    speeds = list(speeds)
    mean = np.mean(speeds)
    yield orig, mean

if __name__ == "__main__":
  CitypathsSummarize.run()

当我运行它时，我使用以下命令，使用默认的mrjob.conf（我的密钥在环境中设置）：

$ python summarize.py -r emr --ec2-instance-type c1.xlarge --num-ec2-instances 4 s3://citypaths/chicago-v4/ > chicago-v4-output.txt

当我在小数据集上运行它时，它就完成了。当我在整个数据语料库（约10GiB值）上运行它时，我得到这样的错误（但不是每次都在同一点！）：

Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-KCPTKZR5OX6D/task-attempts/attempt_201301211911_0001_m_000151_3/syslog):
java.io.FileNotFoundException: /mnt2/var/lib/hadoop/mapred/taskTracker/jobcache/job_201301211911_0001/attempt_201301211911_0001_m_000018_4/output/spill0.out
(while reading from s3://citypaths/chicago-v4/1613640660)
Terminating job flow: j-KCPTKZR5OX6D
Traceback (most recent call last):
  File "summarize.py", line 32, in <module>
    CitypathsSummarize.run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 545, in run
    mr_job.execute()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 561, in execute
    self.run_job()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 631, in run_job
    runner.run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 490, in run
    self._run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1048, in _run
    self._wait_for_job_to_complete()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1830, in _wait_for_job_to_complete
    raise Exception(msg)
Exception: Job on job flow j-KCPTKZR5OX6D failed with status SHUTTING_DOWN: Shut down as step failed
Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-KCPTKZR5OX6D/task-attempts/attempt_201301211911_0001_m_000151_3/syslog):
java.io.FileNotFoundException: /mnt2/var/lib/hadoop/mapred/taskTracker/jobcache/job_201301211911_0001/attempt_201301211911_0001_m_000018_4/output/spill0.out
(while reading from s3://citypaths/chicago-v4/1613640660)

我跑了两次;它在45分钟后第一次死亡，这次它在四小时后死亡。两次都死在不同的文件上。我已经检查了它死亡的两个文件，但没有任何问题。

不知何故，它找不到它写的溢出文件，这让我感到困惑。

编辑：

我再次运行这个工作，几个小时后它再次死亡，这次是一个不同的错误信息。

Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-3GGW2TSIKKW5R/task-attempts/attempt_201301310511_0001_m_001810_0/syslog):
Status Code: 403, AWS Request ID: 9E9E748A55BC6A58, AWS Error Code: RequestTimeTooSkewed, AWS Error Message: The difference between the request time and the current time is too large., S3 Extended Request ID: Ky+HVYZ8RsC3l5f9N3LTwyorY9bbqEnc4tRD/r/xfAHYP/eiQrjjcpmIDNY2eoDo
(while reading from s3://citypaths/chicago-v4/1439606131)

使用MrJob在EMR上发生随机java.io.FileNotFoundException jobcache错误

0 个答案: