我正在阅读apache spark的源代码。我陷入了Range Partitioner草图方法的逻辑。有人可以解释一下这段代码到底在做什么吗?
// spark/core/src/main/scala/org/apache/spark/Partitioner.scala
def sketch[K:ClassTag](rdd: RDD[K],
sampleSizePerPartition: Int): (Long, Array[(Int, Int, Array[K])]) = {
val shift = rdd.id
// val classTagK = classTag[K] // to avoid serializing the entire partitioner object
val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
val seed = byteswap32(idx ^ (shift << 16))
val (sample, n) = SamplingUtils.reservoirSampleAndCount(
iter, sampleSizePerPartition, seed)
Iterator((idx, n, sample))
}.collect()
val numItems = sketched.map(_._2.toLong).sum
(numItems, sketched)
}
答案 0 :(得分:1)
sketch来对RDD分区中的值进行采样。也就是说 - 从每个RDD分区统一和随机地挑选和收集元素值的小子集。
请注意,sketch用作RangePartitioner的一部分 - 用于计算生成的大致相等大小的分区的范围界限。其他很酷的事情发生在其他RangePartitioner代码中 - 即,它计算样本子集的所需大小(sampleSizePerPartition)。
将我的评论作为逐步解释的代码的一部分。
def sketch[K:ClassTag](rdd: RDD[K],
sampleSizePerPartition: Int): (Long, Array[(Int, Int, Array[K])]) = {
val shift = rdd.id
// val classTagK = classTag[K] // to avoid serializing the entire partitioner object
// run sampling function on every partition
val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
// partition number `idx` - and rdd.id are used to calculate unique seed for every partition - to ensure that elements are selected in unique manner for every parition
val seed = byteswap32(idx ^ (shift << 16))
// randomly select sample of n elements and count total number of elements in partition
// what is cool about Reservoir Sampling - that it does it in a single pass - O(N) where N is number of elements in partition
// see more http://en.wikipedia.org/wiki/Reservoir_sampling
val (sample, n) = SamplingUtils.reservoirSampleAndCount(
iter, sampleSizePerPartition, seed)
Iterator((idx, n, sample))
}.collect()
val numItems = sketched.map(_._2.toLong).sum
// returns total count of elements in RDD and samples
(numItems, sketched)
}