我会将RDD[(K, V)]
拆分为多个存储桶,例如输出类型为List[(K, RDD[V])]
,这是我的建议。但是我不满意,因为它依赖于在原始RDD上运行的keysNumber
。是否存在其他需要较少原始RDD运行的处理方式。如果不是,您如何看待递归调用之前放入缓存休息的事实,请确保它会更快,但是由于第一个RDD的沿袭,Spark不会将内存中的存储减至最少,还是节省了~keysNumber
倍的最小版本原始RDD。谢谢。
def kindOfGroupByKey[K : ClassTag, V : ClassTag](rdd: RDD[(K, V)], keys: List[K] = List.empty[K]): List[(K, RDD[V])] = {
val keysIn: List[K] = if (keys.isEmpty) rdd.map(_._1).distinct.collect.toList else keys
@annotation.tailrec
def go(rdd2: RDD[(K, V)], keys: List[K], output: List[(K, RDD[V])]): List[(K, RDD[V])] = {
val currentKey :: keyxs = keys
val filtered = rdd2.filter(_._1 == currentKey)
val rest = rdd2.filter(_._1 != currentKey)
val updatedOutput = (currentKey, filtered.map(_._2)) :: output
if (keys.isEmpty) updatedOutput.reverse
// Supposing rdd is cached, it is good to cache rest or does it will generate many smallest cached version of rdd which risk to overload ram ?
else go(rest, keyxs, updatedOutput)
}
go(rdd, keysIn, List.empty[(K, RDD[V])])
}