自定义分区

时间:2019-06-26 17:58:29

标签: scala apache-spark partitioning

我在传递CustomPartitioner的键,值对的映射的位置定义了以下CustomPartitioner,对于该ID,我希望返回该值。我已经计算出分区并将其放在地图中。值是分区号。

class CustomPartitioner (partitions: Int, accGrpMap: scala.collection.Map[Int, Int]) extends Partitioner {

  override def numPartitions: Int = partitions
  private val LOGGER = LoggerFactory.getLogger(classOf[CustomPartitioner])


  override def getPartition(key: Any): Int  =
  {
    val accGrpId:Int = key.asInstanceOf[String].toInt
    accGrpMap(accGrpId)


  }
  override def equals(other: Any): Boolean = other match {
    case h: CustomPartitioner =>
      h.numPartitions == numPartitions
    case _ =>
      false
  }
}

这就是我叫CustomPartitioner的方式:

rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
  .map(line => if (colIndex == -1) (null, line) else (line.split(TILDE)(colIndex), line))
  .partitionBy(new CustomPartitioner(partitionCount,partitionMap))
  .map { case (_, line) => line }
  .map(line => addEmptyColumns(line, schemaIndexArray))
  .saveAsTextFile(s"$outputPath/$fileDir")

有人可以告诉我这里有什么问题吗?我该如何实现?

0 个答案:

没有答案