Spark Standalone java.lang.OutOfMemoryError

时间:2017-07-12 07:41:56

标签: scala apache-spark

我在64GB ram服务器上安装了Spark Standalone应用程序。我想要处理的数据量远远超过我能承受的数量。

我将大量数据读入一个大表并尝试自我加入。伪代码如下所示:

val df = spark.read.parquet("huge_table.parquet")
val df2 = df.select(...).withColumn(...) // some data manipulations
df.as("df1").join(df2.as("df2"), $"df1.store_name" == $"df2.store_name" && $"df1.city_id" === $"df2.city_id")

我的执行程序设置如下--driver-memory 8g --executor-memory 32g

spark-defaults.conf

spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.extraJavaOptions  -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

问题在于无论我做什么,我都会

Caused by: java.lang.OutOfMemoryError: Java heap space
  at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:53)
  at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
  at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:472)
  at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:142)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
  at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextRow(WindowExec.scala:301)
  at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.<init>(WindowExec.scala:310)
  at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14.apply(WindowExec.scala:290)
  at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14.apply(WindowExec.scala:289)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)

我尝试在数据操作之前保留表格,尝试df.persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)

我知道我没有足够的资源来有效地处理数据,但无论需要多长时间,我还能做些什么才能完成流程。

更新

DF

+----+--------------+---------+
| ID |  store_name  | city_id |
+----+--------------+---------+
|  1 | Apple ...    |      22 |
|  2 | Apple ...    |      33 |
|  3 | Apple ...    |      44 |
+----+--------------+---------+

DF2

+----+--------------+---------+---------+-------------+
| ID |  store_name  | city_id | sale_id | sale_amount |
+----+--------------+---------+---------+-------------+
|  1 | Apple ...    |      33 |       1 | $30         |
|  2 | Apple ...    |      44 |       2 | $50         |
|  3 | Apple ...    |      44 |       3 | $50         |
|  4 | Apple ...    |      44 |       4 | $50         |
|  5 | Apple ...    |      44 |       5 | $40         |
+----+--------------+---------+---------+-------------+

0 个答案:

没有答案