无法使用repartitionAndSortWithinPartitions方法

时间:2017-08-25 10:08:22

标签: scala apache-spark hbase hadoop2

我有一个RDD rddData: RDD[(String, Iterable[(String, String)])],它根据键splits: Array[Array[Byte]]按键和预分割区域排序。以下是我用于使用repartitionAndSortWithinPartitions方法创建分区的代码段:

 protected abstract class HFilePartitioner extends Partitioner {
    def extractKey(n: Any) = n match {
      case (k: java.lang.String, _) => k
      case (_) => 
    }
}

  class datasetPartitioner(splits: Array[Array[Byte]]) extends HFilePartitioner {   

    override def getPartition(key: Any): Int = {
      val k = extractKey(key)
      for (i <- 1 until splits.length)
        if (Bytes.compareTo(Bytes.toBytes(k.toString()), splits(i)) < 0) return i - 1

      splits.length - 1
    }
    override def numPartitions: Int = splits.length
  }

调用repartitionAndSortWithinPartitions方法

val partitionedData = rddData.repartitionAndSortWithinPartitions(new datasetPartitioner(splits))

我可以看到使用partitionedData.partitions.length创建了6个分区。然后我使用partitionedData.mapPartitionsWithIndex((index, it) =>it.toList.map(x => if (index ==1) {println(x._1)}).iterator).collect打印哪个分区包含哪些键,但只获取index == 0的所有键,其他分区不包含任何数据。尽管已创建了6个分区但数据未在分区之间分配。我想在所有分区之间分发数据。

0 个答案:

没有答案