Question

I´m using Spark 2.2.1 with Scala 2.11.12 version as a language to generate a recursive algorithm. First, I tried an implementation using RDD but the time when I used a lot of data was too much. I have made a new version using DataFrames but with very little data it takes too much, taking less data in each iteration than in the previous iteration.

I have tried to cache variables in different ways (types of persistence included), to use checkpoints in different moments, using the repartition method with different values and in different functions, and nothing works.

The code starts looking for the minimum distance between the points that make up the matrix (matrix is a DataFrame):

println("Finding minimum:")
val minDistRes = matrix.select(min("dist")).first().getFloat(0)
val clusterRes = matrix.where($"dist" === minDistRes)

println(s"New minimum:")
clusterRes.show(1)

Then, save the coordenates to the points for later calculations:

val point1 = clusterRes.first().getInt(0)
val point2 = clusterRes.first().getInt(1)

After, made several filters to use them in the new points generated in the next iteration (the creation of a broadcast variable is necessary to be able to access this data in a later map):

matrix = matrix.where("!(idW1 == " + point1 +" and idW2 ==" + point2 + " )").cache()

val dfPoints1 = matrix.where("idW1 == " + point1 + " or idW2 == " + point1).cache()

val dfPoints2 = matrix.where("idW1 == " + point2 + " or idW2 == " + point2).cache()

val dfPoints2Broadcast = spark.sparkContext.broadcast(dfPoints2)

val dfUnionPoints = dfPoints1.union(dfPoints2).cache()

val matrixSub = matrix.except(dfUnionPoints).cache()

Continued with the calculation of the new points and I return the matrix that will be used recursively by the algorithm:

val newPoints = dfPoints1.map{
          r => val distAux = dfPoints2Broadcast.value.where("idW1 == " + r.getInt(0) + 
         " or idW1 == " + r.getInt(1) + " or idW2 == " + r.getInt(0) + " or idW2 == " + 
         r.getInt(1)).first().getFloat(2)

        (newIndex.toInt, filterDF(r.getInt(0),r.getInt(1), point1, point2), math.min(r.getFloat(2), distAux))
 }.asInstanceOf[Dataset[Row]]

matrix = matrixSub.union(newPoints)

Finalize each iteration caching the matrix variable and realized a checkpoint every so often:

matrix.cache()

if (a % 5 == 0)
 matrix.checkpoint()

Performance issue with spark Dataframe, each iteration takes longer

0 个答案: