大量数据上的优化问题:java.lang.OutOfMemoryError:超出了GC开销限制

时间:2019-04-15 12:58:37

标签: scala apache-spark

我有一个代码优化问题,确实我在很大的体积上工作。当我在少量数据上运行代码时,没有问题,但是只要达到实际大小,我就会收到以下错误:java.lang.OutOfMemoryError:超出了GC开销限制 和 java.util.concurrent.TimeoutException:[300秒]后,期货超时

我已经尝试优化此操作(删除unusefull操作...),我尝试在for循环中运行我的代码以依次运行代码,我尝试在for每个内部。

我在“ --num-executors 6 --executor-memory 22G --executor-cores 8 --driver-cores 6 --driver-memory 10G“上运行

...

这是代码的主循环

datatest
  .select("yearWeek", "cluster").distinct().collect().toList
  .map(cluster => (cluster.getString(0), cluster.getString(1)))
  .map(cluster => {
    TestModel(datatest.filter($"cluster" === cluster._2), cluster._1, cluster._2, ModelCcr_1, ModelErlang_1)
  })

...

在函数内部:

...

def TestModel(Test_Filtered: DataFrame, yearWeek: String, Cluster: String, ModelCcr: List[(String, Array[loadModelWithId.DecisionTreeModelElsa])], ModelErlang: List[(String, Array[loadModelWithId.DecisionTreeModelElsa])] ): DataFrame = {

val sparkContext = sparkSession.sparkContext
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._

val initialSetReduce = List(0.0)
val addToSetReduce = (s: List[Double], v: Double) => if (s.head==0.0) List(v) else v :: s
val mergePartitionSetsReduce = (p1: List[Double], p2: List[Double]) => p1 ++ p2



val ModelCcr_Off = ModelCcr.filter(x => x._1 == Cluster)
val ModelErlang_Off = ModelErlang.filter(x => x._1 == Cluster)


val moyenne_ccr = Test_Filtered.select("ccr").agg(mean("ccr")).map(x => x.getDouble(0)).head()

val moyenne_erlang = Test_Filtered.select("erlang").agg(mean("erlang")).map(x => x.getDouble(0)).head()


val TrainForDeploy = Test_Filtered
  .select(
  col("erlang")
  , col("resource")
  ,col("resourceIndexCluster")
  , col("dayTime")
  , col("nbConnections").cast("Double")
  , col("ccr")
  , col("resourceType")
  , col("yearWeek")
  ).rdd.map(x => (
  (x.getDouble(0), Vectors.dense(x.getLong(2),x.getDouble(3), x.getDouble(4), 1D, 1D)),
  (x.getDouble(5), Vectors.dense(x.getLong(2),x.getDouble(3), x.getDouble(4), 1D, 1D))
  , x.getString(6)
  , x.getString(7)
  ,x.getString(1)
)).persist(StorageLevel.MEMORY_AND_DISK)



val TrainDeployCCR = TrainForDeploy
  .map(data =>
    ModelCcr_Off.map(x => x._2
      .map(tree =>
        //Calcul des prédictions pour le ccr
        ((data._2._2, data._2._1, data._3, data._4,data._5), Distrib.noeudPredictPredQuality(tree.decisionTreeModel, data._2._2))
      )))
  .flatMap(x => x.flatten(x=>x))



val TrainDeployErlang = TrainForDeploy
  .map(data =>
    ModelErlang_Off.map(x => x._2
      .map(tree =>
        //Calcul des prédictions pour le ccr
        ((data._1._2, data._1._1, data._3, data._4,data._5), Distrib.noeudPredictPredQuality(tree.decisionTreeModel, data._1._2))
      )))
  .flatMap(x => x.flatten(x=>x))


TrainForDeploy.unpersist()


val TrainNodes_erlang = TrainDeployErlang
  .aggregateByKey(initialSetReduce)(addToSetReduce, mergePartitionSetsReduce) 



val TrainNodes_ccr = TrainDeployCCR
  .aggregateByKey(initialSetReduce)(addToSetReduce, mergePartitionSetsReduce)




val quantile_ccr = TrainNodes_ccr.map(x => (x._1, x._2)) // enelever calcul des quantile mettre hors de la fonction
  .map(x => (x._1._5, x._1._1(1), x._1._2, x._1._3, x._1._4, x._2))
  .toDF("resource", "dayTime", "ccr", "resourceType", "yearWeek", "ListPredCCR")
  .withColumn("erreurdelamoyenneCCR", erreurDeLaMoyenne(col("ccr"), col("ListPredCCR"), lit(moyenne_ccr)))
  .withColumn("erreurAuCarreCCR", erreurAuCarre(col("ccr"), col("ListPredCCR")))




val quantile_erlang = TrainNodes_erlang.map(x => (x._1, x._2)) //a changer le .first
  .map(x => (x._1._5, x._1._1(1), x._1._2, x._1._3, x._1._4, x._2))
  .toDF("resource", "dayTime", "erlang", "resourceType", "yearWeek", "ListPredErlang")
  .withColumn("erreurdelamoyenneErlang", erreurDeLaMoyenne(col("erlang"), col("ListPredErlang"), lit(moyenne_erlang)))
  .withColumn("erreurAuCarreErlang", erreurAuCarre(col("erlang"), col("ListPredErlang")))




val cluster_Data = quantile_ccr.coalesce(1).join(quantile_erlang.coalesce(1), Seq("resource", "dayTime", "yearWeek", "resourceType"), "inner")



cluster_Data

}

...

实际上,代码可以轻松运行5周,而在第6周,代码崩溃并显示GC异常错误或TimedOut错误。

我认为Spark的执行程序可能没有足够的内存来运行所有数据?

我已经尝试添加一些持久或缓存功能,但是没有成功。

我们可以在每次迭代后释放内存吗?或者,如果您还有另一种方法来优化代码。

感谢您的帮助。

0 个答案:

没有答案