将数据帧写入HDFS时出现Spark OOM问题

时间:2018-07-18 03:48:11

标签: scala apache-spark dataframe hdfs cloudera

使用Spark 2.3解决此问题。

我正在具有7个节点的Cloudera集群上运行任务:64 GB内存,每个16核

相关的conf:--conf spark.executor.memoryOverhead=5G --executor-memory 30G --num-executors 15 --executor-cores 5

火花执行程序引发错误:

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:110)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.StaticInvoke7$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:288)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:589)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:380)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

这是我正在运行的代码:

val table_df = spark.createDataFrame(table,schema)
val table_df_filled = table_df.na.fill("null")
table_df_filled.write.mode("overwrite").csv("path")

我试图增加执行程序/驱动程序/开销内存;

我试图通过spark.default.parallelism conf将分区增加到几倍大(4000,8000);

关于数据大小,对于每一行(记录),有几个元数据列和一个大字符串列。我确定问题出在大字符串列上,在该列中我在此字段中保存了单个网页的完整HTML代码(我认为每个网页不超过1 GB?)。总数据大小约为100GB。

有人遇到过类似的问题吗?

一些后续行动:

  • 我已尝试打印整个RDD,但仍会打印通过。
  • 依靠数据框,任务因相同的问题而失败。 所以我猜问题出在Dataframe列大小限制上?
  • 我设法通过saveAsTextFile直接从RDD输出内容,没有问题。

1 个答案:

答案 0 :(得分:0)

事实证明,问题的原因是在从RDD转换DataFrame期间,某些记录达到了数组大小限制。在这种情况下,我有以下两种选择:

  • 将问题字符串列拆分为多个列(减小大小)。
  • 由我自己编码输出格式,并直接通过RDD中的saveAsTextFile函数将数据写入HDFS。