写给Parquet / Kafka:线程中的异常" dag-scheduler-event-loop" java.lang.OutOfMemoryError

时间:2017-09-14 06:12:48

标签: scala out-of-memory spark-dataframe apache-spark-mllib cloudera-cdh

我正在尝试修复我在火花设置中看到的一个内存问题,在这一点上,我无法就具体分析得出结论,为什么我会看到这一点。在将数据框写入镶木地板或卡夫卡时,我总是会看到这个问题。我的数据帧有5000行。它的架构是

root

     |-- A: string (nullable = true)
     |-- B: string (nullable = true)
     |-- C: string (nullable = true)
     |-- D: array (nullable = true)
     |    |-- element: string (containsNull = true)
     |-- E: array (nullable = true)
     |    |-- element: string (containsNull = true)
     |-- F: double (nullable = true)
     |-- G: array (nullable = true)
     |    |-- element: double (containsNull = true)
     |-- H: integer (nullable = true)
     |-- I: double (nullable = true)
     |-- J: double (nullable = true)
     |-- K: array (nullable = true)
     |    |-- element: double (containsNull = false)

其中G列的单元格大小可达16MB。我的数据帧总大小约为10GB,分为12个分区。在编写之前,我试图使用repartition()创建48个分区,但即使我在没有重新分区的情况下编写,也会出现问题。在此异常时,我只有一个缓存的Dataframe,大小约为10GB。我的驱动程序有19GB的可用内存,2个执行程序每个有8 GB的可用内存。 spark版本是2.1.0.cloudera1,scala版本是2.11.8。

我有以下设置:

spark.driver.memory     35G
spark.executor.memory   25G
spark.executor.instances    2
spark.executor.cores    3
spark.driver.maxResultSize      30g
spark.serializer        org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max 1g
spark.rdd.compress      true
spark.rpc.message.maxSize       2046
spark.yarn.executor.memoryOverhead      4096

异常回溯是

Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:991)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:765)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:764)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:764)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1228)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

任何见解?

1 个答案:

答案 0 :(得分:-1)

我们终于找到了这个问题。我们在scala上运行了5000行数据帧的kfold逻辑回归,k大小为4.分类完成后,我们基本上得到了4个大小为1250的测试输出数据帧,每个数据帧至少划分了200个分区。所以我们在5000行数据上有超过800个分区。然后,代码将继续将此数据重新分区为48个分区。我们的系统无法通过改组来处理这种重新分配。为了解决这个问题,我们将每个折叠输出数据帧重新分配到一个较小的数字(而不是在组合的数据帧上进行),这解决了这个问题。