我在64GB ram服务器上安装了Spark Standalone应用程序。我想要处理的数据量远远超过我能承受的数量。
我将大量数据读入一个大表并尝试自我加入。伪代码如下所示:
val df = spark.read.parquet("huge_table.parquet")
val df2 = df.select(...).withColumn(...) // some data manipulations
df.as("df1").join(df2.as("df2"), $"df1.store_name" == $"df2.store_name" && $"df1.city_id" === $"df2.city_id")
我的执行程序设置如下--driver-memory 8g --executor-memory 32g
。
spark-defaults.conf
:
spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.extraJavaOptions -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
问题在于无论我做什么,我都会
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:53)
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:472)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:142)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextRow(WindowExec.scala:301)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.<init>(WindowExec.scala:310)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14.apply(WindowExec.scala:290)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14.apply(WindowExec.scala:289)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
我尝试在数据操作之前保留表格,尝试df.persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
我知道我没有足够的资源来有效地处理数据,但无论需要多长时间,我还能做些什么才能完成流程。
更新
DF
+----+--------------+---------+
| ID | store_name | city_id |
+----+--------------+---------+
| 1 | Apple ... | 22 |
| 2 | Apple ... | 33 |
| 3 | Apple ... | 44 |
+----+--------------+---------+
DF2
+----+--------------+---------+---------+-------------+
| ID | store_name | city_id | sale_id | sale_amount |
+----+--------------+---------+---------+-------------+
| 1 | Apple ... | 33 | 1 | $30 |
| 2 | Apple ... | 44 | 2 | $50 |
| 3 | Apple ... | 44 | 3 | $50 |
| 4 | Apple ... | 44 | 4 | $50 |
| 5 | Apple ... | 44 | 5 | $40 |
+----+--------------+---------+---------+-------------+