我正在为应用程序使用Dstream,并且批处理大小始终相同。最初,每批处理需要3秒钟才能完成所有计算。在50至60个批次之后,批次所需的时间从3秒增加到8秒,然后增加到50秒。是什么原因导致时间增加?
我检查了rdd的大小,没有重大变化。 此外,我在每个批处理的末尾强制执行操作(rdd.take(1))。 我还将每批之间的时间增加到14秒
这里是执行时间爆炸的地方。
def denseUpdateAndPoints(denseMicroClusters: RDD[MicroCluster], points: RDD[DBSCANPoint]):(RDD[MicroCluster],RDD[DBSCANPoint])={
val minEps = this.minEps
val cartesianProduct: RDD[(DBSCANPoint, MicroCluster)] = points.cartesian(clusters)
val distancesPointMicrocluster = cartesianProduct
.map(x => (x._1, (x._2.id,x._2.distanceSquared(x._1))))
val nearestPointTocluster = distancesPointMicrocluster
.filter(_._2._2 <= minEps.value)
.reduceByKey((x,y) => if (x._2<y._2) x else y).cache()
val valuesToUpdateMicroClusters = nearestPointTocluster
.map(e => (e._2._1, (e._1.x, e._1.y, 1.toLong)))
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2, x._3 + y._3)).cache()
// filter the points that they have no near microclusters
val pointsNearMicroClustersSet =
sc.broadcast(nearestPointTocluster.map(x => x._1.pointId).collect.toSet)
val pointsWithNoNearMicroClusters = points
.filter (point => !pointsNearMicroClustersSet.value.contains(point.pointId))
val micros = update
.map(x => MicroCluster(Vectors.dense(x._1 , x._2._1, x._2._2, x._2._3)))
val microClusters = microCluster.union(micros)
.map(x => (x.id,x))
.reduceByKey((p1,p2) =>
MicroCluster(Vectors.dense(p1.id, p1.totalX+p2.totalX, p1.totalY+p2.totalY, p1.totalPoints+p2.totalPoints)))
.map(x => x._2)
microClusters
// update the microcluster
(denseClusters,outlierPoints)