Question

我有一个Spark程序，用于计算用户之间的关系，即它接收类型为

的数据集

RDD[(java.lang.Long, Map[(String, String), Integer])]

Long是时间戳，而地图是与两个用户的元组相关的分数。并应在分数上运行某些功能并返回以下类型：

Map[String, Map[java.lang.Long, java.lang.Double]]

String是元组中的第一个String，map是每个时隙的函数结果。

在我的情况下，我有大约2000个用户，所以我收到的地图非常大（每时间段2000 ^ 2），结果也依赖于之前的时间段结果。

我在本地运行程序并接收GC overhead limit exceeded。我在vmarguments中使用：-Xmx14G将堆内存增加到14g（我看到java进程占用了超过12g的内存），但它没有帮助。

目前已实施的方法

我已经尝试了几个方向来减少内存消耗，目前提出了以下想法：由于每个时间戳仅依赖于前一个时间戳，我将分别收集每个时隙并将之前的结果保留在驱动程序中。通过这种方式，我将仅对部分数据进行计算，并希望它不会破坏程序。

代码：

def calculateScorePerTimeslot(scorePerTimeslotRDD: RDD[(java.lang.Long, Map[(String, String), Integer])]): Map[String, Map[java.lang.Long, java.lang.Double]] = {
   var distancesPerTimeslotVarRDD = distancesPerTimeslotRDD.groupBy(_._1).sortBy(_._1)
   println("Start collecting all the results - cache the data!!")
   distancesPerTimeslotVarRDD.cache()
   println("Caching all the data has completed!")

   while(!distancesPerTimeslotVarRDD.isEmpty())
   {
     val dataForTimeslot: (java.lang.Long, Iterable[(java.lang.Long, Map[(String, String), Integer])]) = distancesPerTimeslotVarRDD.first()
     println("Retrieved data for timeslot: " + dataForTimeslot._1)

     //Code which is irrelevant for question - logic

     println("Removing timeslot: " + dataForTimeslot._1)
     distancesPerTimeslotVarRDD = distancesPerTimeslotVarRDD.filter(t => !t._1.equals(dataForTimeslot._1))
     println("Filtering has complete! - without: " + dataForTimeslot._1)
   }
}

总结：基本上，我们的想法是一次提取一个时间段并将结果保存在驱动程序中 - 这样我就会尝试减少传递collect的数据的大小。

我写这篇文章的原因

不幸的是，这对我没有帮助，程序仍然死亡。我的问题是：这种方式获取RDD的第一项（）然后过滤它具有迭代RDD项目的效果吗？还有其他更好的想法来解决这类问题（更好的想法是不增加内存或转移到真正的分布式集群）？

Answer 1

首先，RDD[(java.lang.Long, Map[(String, String), Integer])]使用的内存多于RDD[(java.lang.Long, Array[(String, String, Integer)])]。如果你可以使用更晚的内存，你将节省一些内存。

其次，你的循环在缓存数据方面效率很低。始终在不再需要的RDD上调用unpersist。

distancesPerTimeslotVarRDD.cache()
var rddSize = distancesPerTimeslotVarRDD.count()
println("Caching all the data has completed!")

while(rddSize > 0) {
  val prevRDD = distancesPerTimeslotVarRDD 

  val dataForTimeslot = distancesPerTimeslotVarRDD.first()
  println("Retrieved data for timeslot: " + dataForTimeslot._1)

  //Code which is irrelevant for question - logic

  println("Removing timeslot: " + dataForTimeslot._1)
  // Cache the new value of distancesPerTimeslotVarRDD
  distancesPerTimeslotVarRDD = distancesPerTimeslotVarRDD.filter(t => !t._1.equals(dataForTimeslot._1)).cache()

  // Force calculation so we can throw away previous iteration value.
  rddSize = distancesPerTimeslotVarRDD.count()
  println("Filtering has complete! - without: " + dataForTimeslot._1)
  // Get rid of previously cached RDD
  prevRDD.unpersist(false)
}

第三，您可以尝试使用Kryo Serializer，但这有时会让事情变得更糟。您必须配置序列化程序和replace cache with persist(StorageLevel.MEMORY_ONLY_SER)

减少Spark程序的内存负载的方法

1 个答案: