Question

当我使用Spark-mllib FP-growth算法挖掘Frequent Itemsets时，我遇到了这些错误。

java.lang.OutOfMemoryError：Java堆空间 at java.util.IdentityHashMap.resize（IdentityHashMap.java:469） at java.util.IdentityHashMap.put（IdentityHashMap.java:445） at org.apache.spark.util.SizeEstimator $ SearchState.enqueue（SizeEstimator.scala：132）在org.apache.spark.util.SizeEstimator $$ anonfun $ visitSingleObject $ 1.apply（SizeEstimator.scala：178）在 org.apache.spark.util.SizeEstimator $$ anonfun $ visitSingleObject $ 1.适用（SizeEstimator.scala：177）在scala.collection.immutable.List.foreach（List.scala：381）.....

但是，我的数据集大小只有1000M，而freqitems的数量只有300，我不知道它为什么会给我一个OOM错误。重新分配也没有帮助。

btw，executor.memory是20G，driver.memory是20G。

代码的一部分：

  val fileInput = args(0)
  val fileOutput = args(1)
  val fileTemp = args(2)
  val sc = new SparkContext(new SparkConf().setAppName("Association Rules"))
  val originData = sc.textFile(fileInput + "/D.dat",48)

  val transactions: RDD[Array[String]] = originData.map(s => s.trim.split(' '))
  val model = new FPGrowth().setMinSupport(0.092).setNumPartitions(48).run(transactions)
  val freqItems = model.freqItemsets.persist()
  val AAnswer = freqItems.sortBy(x => x.items.toString)
  AAnswer.saveAsTextFile(fileOutput + "/D.dat")

Answer 1

只需添加更多内存（默认值可能不足以容纳1GB数据集）。

首先在spark-shell / spark-submit命令中添加以下选项：

spark-submit.sh --driver-memory 4g --executor-memory 4g

（或8g，16g，无论对你有用）

您也可以在代码中执行相同的操作

val conf = new SparkConf()
             .setMaster("local")
             .setAppName("MyApp")
             .set("spark.executor.memory", "4g")
             .set("spark.driver.memory", "4g")
val sc = new SparkContext(conf)

Answer 2

将分区数设置为愚蠢的高度。我做了2000.解决了这些问题。结束了2000 csv文件的结果，但地狱。它工作得很快。显然有一个甜点，但要开始高速运行测试速度和内存问题，并在必要时减少。

Spark ML Lib FP-Growth with Out of Memory

2 个答案: