Mapreduce为大输入文件抛出OutOfMemoryError

时间:2013-12-13 20:14:06

标签: hadoop mapreduce

您好我有一个mapreduce jar,可以很好地运行小输入文件。当我说小时,我的意思是我用少于10行输入创建的样本输入文件。但是当我尝试在大小为1.8GB的输入文件上运行mapreduce时,我得到了OutOfMemoryError。我不确定我应该做什么。

无论如何我可以限制产生的任务数量吗?并且几乎没有任务运行更长的时间?

在我收到此错误之前,在大输入文件上生成了大约20个任务。这是为前两个任务生成的日志的一部分。

13/12/13 12:00:22 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
13/12/13 12:00:22 INFO mapreduce.Job: Running job: job_local1170901099_0001
13/12/13 12:00:22 INFO mapred.LocalJobRunner: OutputCommitter set in config null
13/12/13 12:00:22 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
13/12/13 12:00:22 INFO mapred.LocalJobRunner: Waiting for map tasks
13/12/13 12:00:22 INFO mapred.LocalJobRunner: Starting task: attempt_local1170901099_0001_m_000000_0
13/12/13 12:00:22 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
13/12/13 12:00:22 INFO mapred.Task:  Using ResourceCalculatorProcessTree : null
13/12/13 12:00:22 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/chaitanya.nadig/friendship.txt:0+134217728
13/12/13 12:00:22 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/12/13 12:00:23 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
13/12/13 12:00:23 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
13/12/13 12:00:23 INFO mapred.MapTask: soft limit at 83886080
13/12/13 12:00:23 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
13/12/13 12:00:23 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
13/12/13 12:00:23 INFO mapreduce.Job: Job job_local1170901099_0001 running in uber mode : false
13/12/13 12:00:23 INFO mapreduce.Job:  map 0% reduce 0%
13/12/13 12:00:24 INFO mapred.MapTask: Starting flush of map output
13/12/13 12:00:24 INFO mapred.LocalJobRunner: Starting task: attempt_local1170901099_0001_m_000001_0
13/12/13 12:00:24 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
13/12/13 12:00:24 INFO mapred.Task:  Using ResourceCalculatorProcessTree : null
13/12/13 12:00:24 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/chaitanya.nadig/friendship.txt:134217728+134217728
13/12/13 12:00:24 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/12/13 12:00:24 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
13/12/13 12:00:24 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
13/12/13 12:00:24 INFO mapred.MapTask: soft limit at 83886080
13/12/13 12:00:24 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
13/12/13 12:00:24 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
13/12/13 12:00:25 INFO mapred.MapTask: Starting flush of map output

这是发生错误时生成的日志尾部。

13/12/13 12:00:43 INFO mapred.MapTask: Starting flush of map output
13/12/13 12:00:43 INFO mapred.Task: Task:attempt_local1170901099_0001_m_000020_0 is done. And is in the process of committing
13/12/13 12:00:43 INFO mapred.LocalJobRunner: map
13/12/13 12:00:43 INFO mapred.Task: Task 'attempt_local1170901099_0001_m_000020_0' done.
13/12/13 12:00:43 INFO mapred.LocalJobRunner: Finishing task: attempt_local1170901099_0001_m_000020_0
13/12/13 12:00:43 INFO mapred.LocalJobRunner: Map task executor complete.
13/12/13 12:00:43 WARN mapred.LocalJobRunner: job_local1170901099_0001
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:403)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at org.apache.hadoop.io.Text.setCapacity(Text.java:266)
at org.apache.hadoop.io.Text.append(Text.java:236)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:238)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
13/12/13 12:00:44 INFO mapreduce.Job:  map 100% reduce 0%
13/12/13 12:00:44 INFO mapreduce.Job: Job job_local1170901099_0001 failed with state FAILED due to: NA
13/12/13 12:00:44 INFO mapreduce.Job: Counters: 22
File System Counters
    FILE: Number of bytes read=27635962
    FILE: Number of bytes written=28018656
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=5338170260
    HDFS: Number of bytes written=0
    HDFS: Number of read operations=25
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=1
Map-Reduce Framework
    Map input records=0
    Map output records=0
    Map output bytes=0
    Map output materialized bytes=6
    Input split bytes=122
    Combine input records=0
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=5
    Total committed heap usage (bytes)=530186240
File Input Format Counters 
    Bytes Read=118909386

3 个答案:

答案 0 :(得分:1)

这个答案很晚,但是发布它以防万一。问题是我试图处理的文件已损坏。我得到了不同的文件副本,并在其上运行我的MR工作,一切正常。

答案 1 :(得分:0)

我的第一个冲动是问你的启动参数是什么。通常,当您运行MapReduce并遇到内存不足错误时,您可以使用以下内容作为启动参数:

-Dmapred.map.child.java.opts=-Xmx1G -Dmapred.reduce.child.java.opts=-Xmx1G 

这里的关键是这两个数额是累积的。因此,在启动MapReduce后,您特定添加的数量不应接近超过系统上可用的内存。

答案 2 :(得分:0)

可能会迟到但我通过将以下参数设置为0.2

来解决这个问题

mapred.job.shuffle.input.buffer.percent

这告诉shuffle空间中的reducer JVM只询问0.2%的堆空间,而不是0.7%。你得到的是#34;堆空间"错误,因为shuffle空间正在向JVM询问它不可用的内存。比溢出它只是抛出异常。但是如果你只要求0.2%的几率你将得到内存。一旦你超过分配的内存溢出的逻辑进入了画面。

当然,缺点是失速。

您还可以在运行时计算可用内存量,然后重置缓冲区。