Hadoop:作业在较小的数据集上运行正常但在大数据集时失败

时间:2012-07-22 16:40:12

标签: java hadoop mapreduce hadoop-streaming

我有以下情况

我有3台机器集群,并且有以下确认。

Master

Usage of /:   91.4% of 74.41GB 
MemTotal:       16557308 kB
MemFree:          723736 kB 

Slave 01

Usage of /:   52.9% of 29.76GB
MemTotal:       16466220 kB 
MemFree:         5320860 kB

Slave 02

Usage of /:   19.0% of 19.84GB
MemTotal:       16466220 kB
MemFree:         6173564 kB

的hadoop / CONF /芯-site.xml中

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/work/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://master:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>
<property>
  <name>dfs.datanode.max.xcievers</name>
  <value>4096</value>
</property>
</configuration>

的hadoop / CONF / mapred-site.xml中

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>mapred.job.tracker</name>
  <value>master:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>1</value>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>100</value>
</property>

<property>
  <name>mapred.task.timeout</name>
  <value>0</value>
</property>

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx512m</value>
</property>
</configuration>

的hadoop / CONF / HDFS-site.xml中

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>dfs.replication</name>
  <value>3</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>
<property>
  <name>dfs.datanode.socket.write.timeout</name>
  <value>0</value>
</property>
</configuration>
  • 我有超过200万个XML文档(每个文档大小约400 KB)
  • map个任务会打开每个xml,并将其作为JSON
  • 发出
  • reduce任务将这些JSON中的每一个都作为字符串,应用转换并发出它
  • 没有。 map任务 - 100
  • 没有。 reduce任务 - 01
  • number of documents = 10,000
  • 时整个作业运行良好
  • number of documents = 278262时,作业失败,我发现以下各种问题

在WebUI上

on slave-01,slave-02

java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 255.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

在主人

java.lang.RuntimeException: java.io.IOException: Spill failed
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:261)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:255)
Caused by: java.io.IOException: Spill failed
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
    at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
    at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:381)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill1.out
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
    at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:121)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1392)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)

java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Creation of /work/app/hadoop/tmp/mapred/local/userlogs/job_201207220051_0001/attempt_201207220051_0001_m_000004_2 failed.
    at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:102)
    at org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71)
    at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316)
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228)

-------
java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Creation of /work/app/hadoop/tmp/mapred/local/userlogs/job_201207220051_0001/attempt_201207220051_0001_m_000004_2.cleanup failed.
    at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:102)
    at org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71)
    at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316)
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228)

当我去查看slaves中的日志时,这就是我在hadoop-hduser-datanode-hadoop-01.log

中找到的内容
2012-07-22 09:26:52,795 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-5384386931827098009_1010 src: /10.0.0.81:51402 dest: /10.0.0.82:50010
2012-07-22 09:26:52,800 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver constructor. Cause is 
2012-07-22 09:26:52,800 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-5384386931827098009_1010 received exception java.io.IOException: Unexpected problem in creating temporary file for blk_-5384386931827098009_1010.  File /work/app/hadoop/tmp/dfs/data/tmp/blk_-5384386931827098009 should not be present, but is.
2012-07-22 09:26:52,800 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.0.0.82:50010, storageID=DS-735951984-127.0.1.1-50010-1342943517618, infoPort=50075, ipcPort=50020):DataXceiver
java.io.IOException: Unexpected problem in creating temporary file for blk_-5384386931827098009_1010.  File /work/app/hadoop/tmp/dfs/data/tmp/blk_-5384386931827098009 should not be present, but is.
        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.createTmpFile(FSDataset.java:426)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.createTmpFile(FSDataset.java:404)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.createTmpFile(FSDataset.java:1249)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1138)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:99)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:299)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
        at java.lang.Thread.run(Thread.java:662)

请帮助我了解为解决此问题需要做些什么?

2 个答案:

答案 0 :(得分:14)

由于您有多个reducer,您的映射器会将输出写入从站上的本地磁盘(而不是HDFS)。更准确地说,映射器实际上不会立即写入本地磁盘。相反,它们将输出缓冲在内存中,直到达到阈值(请参阅“io.sort.mb”配置设置)。此过程称为溢出。我认为问题在于,当Hadoop尝试溢出到磁盘时,您的从设备没有足够的磁盘空间来容纳映射器生成的所有数据。

您提到每个映射器都会生成一个json字符串。假设每个文档大约100KB(可能甚至比这个大),它将达到278,262 x 100KB = ~28GB,并且你的两个奴隶每个都有大约15GB的可用空间。

我认为最简单的方法是使用以下两个配置设置压缩映射器的即时输出:

<property>
  <name> mapreduce.map.output.compress</name> 
  <value>true</value>
</property>
<property>
  <name>mapreduce.map.output.compress.codec</name>
  <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

由于您的数据都是JSON /文本数据,我认为您将受益于Hadoop支持的任何压缩算法。

作为一个FYI,如果您的文档大小超过2 mil,您应该考虑为主服务器添加更多内存。根据经验,每个文件/目录/块占用大约150个字节(或每100万个文件300MB)。但实际上,我每100万个文件会保留1GB。

答案 1 :(得分:0)

我遇到了同样的问题(在Mac OS X上)并通过在mapred-site.xml中设置以下值来解决它

<name>mapred.child.ulimit</name>
<value>unlimited</value>

然后我停止了hadoop服务bin/stop-all.sh,删除了/ usr / local / tmp /文件夹,格式化了namenode bin/hadoop namenode -format并启动了hadoop服务bin/start-all.sh