Performance issue with spark Dataframe, each iteration takes longer

时间:2018-02-05 12:59:06

标签: scala apache-spark apache-spark-sql spark-dataframe

I´m using Spark 2.2.1 with Scala 2.11.12 version as a language to generate a recursive algorithm. First, I tried an implementation using RDD but the time when I used a lot of data was too much. I have made a new version using DataFrames but with very little data it takes too much, taking less data in each iteration than in the previous iteration.

I have tried to cache variables in different ways (types of persistence included), to use checkpoints in different moments, using the repartition method with different values ​​and in different functions, and nothing works.

The code starts looking for the minimum distance between the points that make up the matrix (matrix is a DataFrame):

println("Finding minimum:")
val minDistRes = matrix.select(min("dist")).first().getFloat(0)
val clusterRes = matrix.where($"dist" === minDistRes)

println(s"New minimum:")
clusterRes.show(1) 

Then, save the coordenates to the points for later calculations:

val point1 = clusterRes.first().getInt(0)
val point2 = clusterRes.first().getInt(1)

After, made several filters to use them in the new points generated in the next iteration (the creation of a broadcast variable is necessary to be able to access this data in a later map):

matrix = matrix.where("!(idW1 == " + point1 +" and idW2 ==" + point2 + " )").cache()

val dfPoints1 = matrix.where("idW1 == " + point1 + " or idW2 == " + point1).cache()

val dfPoints2 = matrix.where("idW1 == " + point2 + " or idW2 == " + point2).cache()

val dfPoints2Broadcast = spark.sparkContext.broadcast(dfPoints2)

val dfUnionPoints = dfPoints1.union(dfPoints2).cache()

val matrixSub = matrix.except(dfUnionPoints).cache()

Continued with the calculation of the new points and I return the matrix that will be used recursively by the algorithm:

val newPoints = dfPoints1.map{
          r => val distAux = dfPoints2Broadcast.value.where("idW1 == " + r.getInt(0) + 
         " or idW1 == " + r.getInt(1) + " or idW2 == " + r.getInt(0) + " or idW2 == " + 
         r.getInt(1)).first().getFloat(2)

        (newIndex.toInt, filterDF(r.getInt(0),r.getInt(1), point1, point2), math.min(r.getFloat(2), distAux))
 }.asInstanceOf[Dataset[Row]]

matrix = matrixSub.union(newPoints)

Finalize each iteration caching the matrix variable and realized a checkpoint every so often:

matrix.cache()

if (a % 5 == 0)
 matrix.checkpoint()

0 个答案:

没有答案