如何为同等大小的分区的Spark RDD定义自定义分区程序,其中每个分区具有相同数量的元素?

时间:2014-04-17 07:41:10

标签: scala hadoop apache-spark

我是Spark的新手。我有一个大的元素数据集[RDD],我想把它分成两个完全相同大小的分区,维护元素的顺序。我尝试使用RangePartitioner之类的

var data = partitionedFile.partitionBy(new RangePartitioner(2, partitionedFile))

这并没有给出令人满意的结果,因为它大致划分但不完全相等,维持元素的顺序。 例如,如果有64个元素,我们使用 Rangepartitioner,然后它分为31个元素和33个元素。

我需要一个分区器,以便我在一半中获得前32个元素,而另一半包含第二组32个元素。 你能不能通过建议如何使用自定义分区器来帮助我,这样我可以得到两个相同大小的两半,保持元素的顺序?

3 个答案:

答案 0 :(得分:23)

Partitioner通过为分区分配密钥来工作。您需要事先了解密钥分发,或者查看所有密钥,才能创建这样的分区程序。这就是Spark没有为您提供的原因。

通常,您不需要这样的分区程序。事实上,我无法想出一个需要大小分区的用例。如果元素的数量是奇数怎么办?

无论如何,我们假设你有一个由顺序Int键控的RDD,你知道总共有多少。然后你可以这样写一个自定义的Partitioner

class ExactPartitioner[V](
    partitions: Int,
    elements: Int)
  extends Partitioner {

  def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[Int]
    // `k` is assumed to go continuously from 0 to elements-1.
    return k * partitions / elements
  }
}

答案 1 :(得分:12)

这个答案有一些来自丹尼尔的灵感,但是提供了一个完整的实现(使用 pimp my library pattern ),并提供了人们复制和粘贴需求的示例:)

import RDDConversions._

trait RDDWrapper[T] {
  def rdd: RDD[T]
}

// TODO View bounds are deprecated, should use context bounds
// Might need to change ClassManifest for ClassTag in spark 1.0.0
case class RichPairRDD[K <% Ordered[K] : ClassManifest, V: ClassManifest](
  rdd: RDD[(K, V)]) extends RDDWrapper[(K, V)] {
  // Here we use a single Long to try to ensure the sort is balanced, 
  // but for really large dataset, we may want to consider
  // using a tuple of many Longs or even a GUID
  def sortByKeyGrouped(numPartitions: Int): RDD[(K, V)] =
    rdd.map(kv => ((kv._1, Random.nextLong()), kv._2)).sortByKey()
    .grouped(numPartitions).map(t => (t._1._1, t._2))
}

case class RichRDD[T: ClassManifest](rdd: RDD[T]) extends RDDWrapper[T] {
  def grouped(size: Int): RDD[T] = {
    // TODO Version where withIndex is cached
    val withIndex = rdd.mapPartitions(_.zipWithIndex)

    val startValues =
      withIndex.mapPartitionsWithIndex((i, iter) => 
        Iterator((i, iter.toIterable.last))).toArray().toList
      .sortBy(_._1).map(_._2._2.toLong).scan(-1L)(_ + _).map(_ + 1L)

    withIndex.mapPartitionsWithIndex((i, iter) => iter.map {
      case (value, index) => (startValues(i) + index.toLong, value)
    })
    .partitionBy(new Partitioner {
      def numPartitions: Int = size
      def getPartition(key: Any): Int = 
        (key.asInstanceOf[Long] * numPartitions.toLong / startValues.last).toInt
    })
    .map(_._2)
  }
}

然后在另一个文件中

// TODO modify above to be implicit class, rather than have implicit conversions
object RDDConversions {
  implicit def toRichRDD[T: ClassManifest](rdd: RDD[T]): RichRDD[T] = 
    new RichRDD[T](rdd)
  implicit def toRichPairRDD[K <% Ordered[K] : ClassManifest, V: ClassManifest](
    rdd: RDD[(K, V)]): RichPairRDD[K, V] = RichPairRDD(rdd)
  implicit def toRDD[T](rdd: RDDWrapper[T]): RDD[T] = rdd.rdd
}

然后针对您的用例,您只需要(假设它已经排序)

import RDDConversions._

yourRdd.grouped(2)

免责声明:未经测试,有点直接写入SO答案

答案 2 :(得分:0)

在较新版本的Spark中,您可以编写自己的Partitioner并使用方法zipWithIndex

想法是

  • 索引您的RDD
  • 使用索引作为键
  • 根据所需分区的数量应用自定义Partitioner

示例代码如下所示:

  // define custom Partitioner Class
  class EqualDistributionPartitioner(numberOfPartitions: Int) extends Partitioner {
    override def numPartitions: Int = numberOfPartitions

    override def getPartition(key: Any): Int = {
      (key.asInstanceOf[Long] % numberOfPartitions).toInt
    }
  }

  // create test RDD (starting with one partition)
  val testDataRaw = Seq(
    ("field1_a", "field2_a"),
    ("field1_b", "field2_b"),
    ("field1_c", "field2_c"),
    ("field1_d", "field2_d"),
    ("field1_e", "field2_e"),
    ("field1_f", "field2_f"),
    ("field1_g", "field2_g"),
    ("field1_h", "field2_h"),
    ("field1_k", "field2_k"),
    ("field1_l", "field2_l"),
    ("field1_m", "field2_m"),
    ("field1_n", "field2_n")
  )
  val testRdd: RDD[(String, String)] = spark.sparkContext.parallelize(testDataRaw, 1)

  // create index
  val testRddWithIndex: RDD[(Long, (String, String))] = testRdd.zipWithIndex().map(msg => (msg._2, msg._1))

  // use index for equally distribution
  // Example with six partitions
  println("Example with 2 partitions:")
  val equallyDistributedPartitionTwo = testRddWithIndex.partitionBy(new EqualDistributionPartitioner(2))
  equallyDistributedPartitionTwo.foreach(k => println(s"Partition: ${TaskContext.getPartitionId()}, Content: $k"))

  println("\nExample with 4 partitions:")
  // Example with four partitions
  val equallyDistributedPartitionFour = testRddWithIndex.partitionBy(new EqualDistributionPartitioner(4))
  equallyDistributedPartitionFour.foreach(k => println(s"Partition: ${TaskContext.getPartitionId()}, Content: $k"))

其中spark是您的SparkSession

作为输出,您将得到:

Example with 2 partitions:
Partition: 0, Content: (0,(field1_a,field2_a))
Partition: 1, Content: (1,(field1_b,field2_b))
Partition: 0, Content: (2,(field1_c,field2_c))
Partition: 1, Content: (3,(field1_d,field2_d))
Partition: 0, Content: (4,(field1_e,field2_e))
Partition: 1, Content: (5,(field1_f,field2_f))
Partition: 0, Content: (6,(field1_g,field2_g))
Partition: 1, Content: (7,(field1_h,field2_h))
Partition: 0, Content: (8,(field1_k,field2_k))
Partition: 1, Content: (9,(field1_l,field2_l))
Partition: 0, Content: (10,(field1_m,field2_m))
Partition: 1, Content: (11,(field1_n,field2_n))

Example with 4 partitions:
Partition: 0, Content: (0,(field1_a,field2_a))
Partition: 0, Content: (4,(field1_e,field2_e))
Partition: 0, Content: (8,(field1_k,field2_k))
Partition: 3, Content: (3,(field1_d,field2_d))
Partition: 3, Content: (7,(field1_h,field2_h))
Partition: 3, Content: (11,(field1_n,field2_n))
Partition: 1, Content: (1,(field1_b,field2_b))
Partition: 1, Content: (5,(field1_f,field2_f))
Partition: 1, Content: (9,(field1_l,field2_l))
Partition: 2, Content: (2,(field1_c,field2_c))
Partition: 2, Content: (6,(field1_g,field2_g))
Partition: 2, Content: (10,(field1_m,field2_m))