一段时间后,火花dstream处理成倍增加

时间:2019-06-23 08:32:37

标签: apache-spark spark-streaming

我正在为应用程序使用Dstream,并且批处理大小始终相同。最初,每批处理需要3秒钟才能完成所有计算。在50至60个批次之后,批次所需的时间从3秒增加到8秒,然后增加到50秒。是什么原因导致时间增加?

我检查了rdd的大小,没有重大变化。 此外,我在每个批处理的末尾强制执行操作(rdd.take(1))。 我还将每批之间的时间增加到14秒

这里是执行时间爆炸的地方。

def denseUpdateAndPoints(denseMicroClusters: RDD[MicroCluster], points: RDD[DBSCANPoint]):(RDD[MicroCluster],RDD[DBSCANPoint])={
  val minEps = this.minEps
  val cartesianProduct: RDD[(DBSCANPoint, MicroCluster)] = points.cartesian(clusters)
  val distancesPointMicrocluster = cartesianProduct
          .map(x => (x._1, (x._2.id,x._2.distanceSquared(x._1))))
  val nearestPointTocluster = distancesPointMicrocluster
         .filter(_._2._2 <= minEps.value)
         .reduceByKey((x,y) => if (x._2<y._2) x else y).cache()

  val valuesToUpdateMicroClusters = nearestPointTocluster
      .map(e => (e._2._1, (e._1.x, e._1.y, 1.toLong)))
      .reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2, x._3 + y._3)).cache()
  // filter the points that they have no near microclusters
  val pointsNearMicroClustersSet =
           sc.broadcast(nearestPointTocluster.map(x => x._1.pointId).collect.toSet)
  val pointsWithNoNearMicroClusters = points
          .filter (point => !pointsNearMicroClustersSet.value.contains(point.pointId))
  val micros = update
              .map(x =>  MicroCluster(Vectors.dense(x._1 , x._2._1, x._2._2, x._2._3)))
  val microClusters = microCluster.union(micros)
          .map(x => (x.id,x))
          .reduceByKey((p1,p2) =>
         MicroCluster(Vectors.dense(p1.id, p1.totalX+p2.totalX, p1.totalY+p2.totalY, p1.totalPoints+p2.totalPoints)))
           .map(x => x._2)
  microClusters
  // update the microcluster
  (denseClusters,outlierPoints)

0 个答案:

没有答案