Spark:保存到HDFS时内存不足错误

时间:2015-04-09 13:44:09

标签: hadoop apache-spark hdfs

当我将大数据保存到hdfs

时,我遇到了OOME
val accumulableCollection = sc.accumulableCollection(ArrayBuffer[String]())
val rdd = textfile.filter(row => {
    if (row.endsWith(",")) {
        accumulableCollection += row
        false
    } else if (row.length < 100) {
        accumulableCollection += row
        false
    }
    valid
})
rdd.cache()
val rdd2 = rdd.map(_.split(","))
val rdd3 = rdd2.filter(row => {
    var valid = true
    for((k,v) <- fieldsMap if valid ) {
        if (StringUtils.isBlank(row(k)) || "NULL".equalsIgnoreCase(row(k))) {
            accumulableCollection += row.mkString(",")
            valid = false
        }
    }
    valid
})
sc.parallelize(accumulableCollection.value).saveAsTextFile(hdfsPath)

我在spark-submit中使用它:

--num-executors 2 --driver-memory 1G --executor-memory 1G --executor-cores 2

这是日志的输出:

15/04/12 18:46:49 WARN scheduler.TaskSetManager: Stage 4 contains a task of very large size (37528 KB). The maximum recommended task size is 100 KB.
15/04/12 18:46:49 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 8, worker4, PROCESS_LOCAL, 38429279 bytes)
15/04/12 18:46:49 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0 (TID 9, worker3, PROCESS_LOCAL, 38456846 bytes)
15/04/12 18:46:50 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 4.0 (TID 10, worker4, PROCESS_LOCAL, 38426488 bytes)
15/04/12 18:46:51 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 4.0 (TID 11, worker3, PROCESS_LOCAL, 38445061 bytes)
15/04/12 18:46:51 INFO cluster.YarnClusterScheduler: Cancelling stage 4
15/04/12 18:46:51 INFO cluster.YarnClusterScheduler: Stage 4 was cancelled
15/04/12 18:46:51 INFO scheduler.DAGScheduler: Job 4 failed: saveAsTextFile at WriteToHdfs.scala:87, took 5.713617 s
15/04/12 18:46:51 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: Job aborted due to stage failure: Serialized task 8:0 was 38617206 bytes, which exceeds max allowed: spark.akka.frameSize (10485760 bytes) - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values.)
Exception in thread "Driver" org.apache.spark.SparkException: Job aborted due to stage failure: **Serialized task 8:0 was 30617206 bytes, which exceeds max allowed: spark.akka.frameSize (10485760 bytes)** - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values.

序列化任务8:0是30617206字节,超过了最大允许值:spark.akka.frameSize(10485760字节) ---(1)什么是30MB序列化任务?

考虑对大值使用广播变量。 ---(2)什么应该是广播变量? RDD2?或者accumulableCollection,因为那是我写给HDFS的东西?

当我增加frameSize时,错误现在是:java.lang.OutOfMemoryError:Java堆空间,所以我必须将驱动程序内存和执行程序内存增加到2G才能工作。如果accumulableCollection.value.length是500,000,我需要使用3G。这是正常的吗?

该文件只有146MB,包含200,000行(2G内存)。 (在HDFS中,它分为2个分区,每个分区包含73MB)

2 个答案:

答案 0 :(得分:4)

Spark中的中心编程抽象是RDD,you can create them in two ways

  

(1)并行化驱动程序中的现有集合,或   (2)引用外部存储系统中的数据集,例如共享   文件系统,HDFS,HBase或提供Hadoop的任何数据源   InputFormat。

parallelize()方法(1)要求您将整个数据集放在一台计算机的内存中(第26页学习Spark)。

方法(2),称为External Datasets,应该用于大文件。

以下行使用accumulableCollection.value的内容创建RDD并要求它适合单个计算机:

sc.parallelize(accumulableCollection.value)

缓存RDD时,您可能还会超出内存:

rdd.cache()

这意味着整个textfile RDD存储在内存中。你很可能不想这样做。有关为数据选择缓存级别的建议,请参阅Spark documentation

答案 1 :(得分:1)

这几乎意味着它所说的。您正在尝试序列化一个非常大的单个对象。您可能应该重写代码而不执行此操作。

例如,我不清楚为什么要尝试更新可累积集合,并在filter中执行此操作,甚至可以执行多次。然后你缓存RDD,但是你已经尝试在驱动程序上复制了它?然后你将其他值添加到本地集合,然后再将其转换为RDD?

为什么累积收集呢?只需操作RDD即可。这里有很多冗余。