如何修复PySpark,JDK内存问题?

时间:2019-07-01 13:40:54

标签: apache-spark ubuntu pyspark apache-spark-mllib

我似乎在使用PySpark的ML包时遇到内存问题。我正在尝试在4,000万行数据帧上使用ALS.fit。使用JDK-11会产生错误:

"java.lang.NoSuchMethodError: sun.nio.ch.DirectBuffer.cleaner()Lsun/misc/Cleaner" 

它可以处理1300万行,所以我想这是一个内存清理问题。

我使用Java JDK-8进行了尝试,如下所示: Apache Spark method not found sun.nio.ch.DirectBuffer.cleaner()Lsun/misc/Cleaner;

,但是我仍然遇到错误,因为堆内存不足:我收到以下错误消息:

"... java.lang.OutOfMemoryError: Java heap space ..."

有人知道如何规避吗?

我正在使用Ubuntu 18.04 LTS,Python 3.6和PySpark 2.4.2。

编辑:

这是我如何将Spark Context配置修补在一起:

  • 我有16 GB的RAM
conf = spark.sparkContext._conf.setAll([
      ("spark.driver.extraJavaOptions","-Xss800M"),
      ("spark.memory.offHeap.enabled", True),
      ("spark.memory.offHeap.size","4g"),
      ('spark.executor.memory', '4g'), 
      ('spark.app.name', 'Spark Updated Conf'),
      ('spark.executor.cores', '2'), 
      ('spark.cores.max', '2'),
      ('spark.driver.memory','6g')])

我不确定这是否有意义!

以下是错误消息的第一行:

[Stage 8:==================================================>   (186 + 12) / 200]19/07/02 14:43:29 WARN MemoryStore: Not enough space to cache rdd_37_196 in memory! (computed 3.6 MB so far)
19/07/02 14:43:29 WARN MemoryStore: Not enough space to cache rdd_37_192 in memory! (computed 5.8 MB so far)
19/07/02 14:43:29 WARN BlockManager: Persisting block rdd_37_192 to disk instead.
19/07/02 14:43:29 WARN BlockManager: Persisting block rdd_37_196 to disk instead.
19/07/02 14:43:29 WARN MemoryStore: Not enough space to cache rdd_37_197 in memory! (computed 3.7 MB so far)
19/07/02 14:43:29 WARN BlockManager: Persisting block rdd_37_197 to disk instead.
19/07/02 14:43:29 WARN MemoryStore: Not enough space to cache rdd_37_196 in memory! (computed 3.6 MB so far)
[Stage 8:======================================================>(197 + 3) / 200]19/07/02 14:43:29 WARN MemoryStore: Not enough space to cache rdd_37_192 in memory! (computed 5.8 MB so far)
[Stage 9:>                                                        (0 + 10) / 10]19/07/02 14:43:37 WARN BlockManager: Block rdd_40_3 could not be removed as it was not found on disk or in memory
19/07/02 14:43:37 WARN BlockManager: Block rdd_40_4 could not be removed as it was not found on disk or in memory
19/07/02 14:43:37 WARN BlockManager: Block rdd_40_7 could not be removed as it was not found on disk or in memory
19/07/02 14:43:37 WARN BlockManager: Block rdd_41_3 could not be removed as it was not found on disk or in memory
19/07/02 14:43:37 WARN BlockManager: Block rdd_41_4 could not be removed as it was not found on disk or in memory
19/07/02 14:43:37 WARN BlockManager: Block rdd_41_7 could not be removed as it was not found on disk or in memory
19/07/02 14:43:38 ERROR Executor: Exception in task 7.0 in stage 9.0 (TID 435)
java.lang.OutOfMemoryError: Java heap space
19/07/02 14:43:39 WARN BlockManager: Block rdd_40_5 could not be removed as it was not found on disk or in memory
19/07/02 14:43:38 ERROR Executor: Exception in task 4.0 in stage 9.0 (TID 432)
java.lang.OutOfMemoryError: Java heap space
        at scala.collection.mutable.ArrayBuilder$ofInt.mkArray(ArrayBuilder.scala:327)
[...]

1 个答案:

答案 0 :(得分:0)

最终,您可能想借助-Xmx参数扩展内存堆。

您可以使用各种方法确定需要多少内存。您可以简单地增加堆直到它起作用,或者可以定义非常大的堆,然后查看使用了多少堆并使其变得正确。

您可以使用不同的方式监视堆使用情况,例如:

  • 使用选项运行您的应用程序以写入垃圾收集日志-XX:+ PrintGCTimeStamps -XX:+ PrintGCDetails -verbose:gc -Xloggc:/some_path/gc.log
  • 使用命令行选项运行应用程序:-XX:NativeMemoryTracking = summary或​​-XX:NativeMemoryTracking = detail并使用jcmd实用程序:jcmd VM.native_memory摘要
  • 或其他方式,即使使用图形实用程序,也可以在需要时使用google搜索。