我有一个RDD rddData: RDD[(String, Iterable[(String, String)])]
,它根据键splits: Array[Array[Byte]]
按键和预分割区域排序。以下是我用于使用repartitionAndSortWithinPartitions
方法创建分区的代码段:
protected abstract class HFilePartitioner extends Partitioner {
def extractKey(n: Any) = n match {
case (k: java.lang.String, _) => k
case (_) =>
}
}
class datasetPartitioner(splits: Array[Array[Byte]]) extends HFilePartitioner {
override def getPartition(key: Any): Int = {
val k = extractKey(key)
for (i <- 1 until splits.length)
if (Bytes.compareTo(Bytes.toBytes(k.toString()), splits(i)) < 0) return i - 1
splits.length - 1
}
override def numPartitions: Int = splits.length
}
调用repartitionAndSortWithinPartitions
方法
val partitionedData = rddData.repartitionAndSortWithinPartitions(new datasetPartitioner(splits))
我可以看到使用partitionedData.partitions.length
创建了6个分区。然后我使用partitionedData.mapPartitionsWithIndex((index, it) =>it.toList.map(x => if (index ==1) {println(x._1)}).iterator).collect
打印哪个分区包含哪些键,但只获取index == 0
的所有键,其他分区不包含任何数据。尽管已创建了6个分区但数据未在分区之间分配。我想在所有分区之间分发数据。