我是Spark的新手。我有一个大的元素数据集[RDD],我想把它分成两个完全相同大小的分区,维护元素的顺序。我尝试使用RangePartitioner
之类的
var data = partitionedFile.partitionBy(new RangePartitioner(2, partitionedFile))
这并没有给出令人满意的结果,因为它大致划分但不完全相等,维持元素的顺序。
例如,如果有64个元素,我们使用
Rangepartitioner
,然后它分为31个元素和33个元素。
我需要一个分区器,以便我在一半中获得前32个元素,而另一半包含第二组32个元素。 你能不能通过建议如何使用自定义分区器来帮助我,这样我可以得到两个相同大小的两半,保持元素的顺序?
答案 0 :(得分:23)
Partitioner
通过为分区分配密钥来工作。您需要事先了解密钥分发,或者查看所有密钥,才能创建这样的分区程序。这就是Spark没有为您提供的原因。
通常,您不需要这样的分区程序。事实上,我无法想出一个需要大小分区的用例。如果元素的数量是奇数怎么办?
无论如何,我们假设你有一个由顺序Int
键控的RDD,你知道总共有多少。然后你可以这样写一个自定义的Partitioner
:
class ExactPartitioner[V](
partitions: Int,
elements: Int)
extends Partitioner {
def getPartition(key: Any): Int = {
val k = key.asInstanceOf[Int]
// `k` is assumed to go continuously from 0 to elements-1.
return k * partitions / elements
}
}
答案 1 :(得分:12)
这个答案有一些来自丹尼尔的灵感,但是提供了一个完整的实现(使用 pimp my library pattern ),并提供了人们复制和粘贴需求的示例:)
import RDDConversions._
trait RDDWrapper[T] {
def rdd: RDD[T]
}
// TODO View bounds are deprecated, should use context bounds
// Might need to change ClassManifest for ClassTag in spark 1.0.0
case class RichPairRDD[K <% Ordered[K] : ClassManifest, V: ClassManifest](
rdd: RDD[(K, V)]) extends RDDWrapper[(K, V)] {
// Here we use a single Long to try to ensure the sort is balanced,
// but for really large dataset, we may want to consider
// using a tuple of many Longs or even a GUID
def sortByKeyGrouped(numPartitions: Int): RDD[(K, V)] =
rdd.map(kv => ((kv._1, Random.nextLong()), kv._2)).sortByKey()
.grouped(numPartitions).map(t => (t._1._1, t._2))
}
case class RichRDD[T: ClassManifest](rdd: RDD[T]) extends RDDWrapper[T] {
def grouped(size: Int): RDD[T] = {
// TODO Version where withIndex is cached
val withIndex = rdd.mapPartitions(_.zipWithIndex)
val startValues =
withIndex.mapPartitionsWithIndex((i, iter) =>
Iterator((i, iter.toIterable.last))).toArray().toList
.sortBy(_._1).map(_._2._2.toLong).scan(-1L)(_ + _).map(_ + 1L)
withIndex.mapPartitionsWithIndex((i, iter) => iter.map {
case (value, index) => (startValues(i) + index.toLong, value)
})
.partitionBy(new Partitioner {
def numPartitions: Int = size
def getPartition(key: Any): Int =
(key.asInstanceOf[Long] * numPartitions.toLong / startValues.last).toInt
})
.map(_._2)
}
}
然后在另一个文件中
// TODO modify above to be implicit class, rather than have implicit conversions
object RDDConversions {
implicit def toRichRDD[T: ClassManifest](rdd: RDD[T]): RichRDD[T] =
new RichRDD[T](rdd)
implicit def toRichPairRDD[K <% Ordered[K] : ClassManifest, V: ClassManifest](
rdd: RDD[(K, V)]): RichPairRDD[K, V] = RichPairRDD(rdd)
implicit def toRDD[T](rdd: RDDWrapper[T]): RDD[T] = rdd.rdd
}
然后针对您的用例,您只需要(假设它已经排序)
import RDDConversions._
yourRdd.grouped(2)
免责声明:未经测试,有点直接写入SO答案
答案 2 :(得分:0)
在较新版本的Spark中,您可以编写自己的Partitioner并使用方法zipWithIndex
想法是
Partitioner
示例代码如下所示:
// define custom Partitioner Class
class EqualDistributionPartitioner(numberOfPartitions: Int) extends Partitioner {
override def numPartitions: Int = numberOfPartitions
override def getPartition(key: Any): Int = {
(key.asInstanceOf[Long] % numberOfPartitions).toInt
}
}
// create test RDD (starting with one partition)
val testDataRaw = Seq(
("field1_a", "field2_a"),
("field1_b", "field2_b"),
("field1_c", "field2_c"),
("field1_d", "field2_d"),
("field1_e", "field2_e"),
("field1_f", "field2_f"),
("field1_g", "field2_g"),
("field1_h", "field2_h"),
("field1_k", "field2_k"),
("field1_l", "field2_l"),
("field1_m", "field2_m"),
("field1_n", "field2_n")
)
val testRdd: RDD[(String, String)] = spark.sparkContext.parallelize(testDataRaw, 1)
// create index
val testRddWithIndex: RDD[(Long, (String, String))] = testRdd.zipWithIndex().map(msg => (msg._2, msg._1))
// use index for equally distribution
// Example with six partitions
println("Example with 2 partitions:")
val equallyDistributedPartitionTwo = testRddWithIndex.partitionBy(new EqualDistributionPartitioner(2))
equallyDistributedPartitionTwo.foreach(k => println(s"Partition: ${TaskContext.getPartitionId()}, Content: $k"))
println("\nExample with 4 partitions:")
// Example with four partitions
val equallyDistributedPartitionFour = testRddWithIndex.partitionBy(new EqualDistributionPartitioner(4))
equallyDistributedPartitionFour.foreach(k => println(s"Partition: ${TaskContext.getPartitionId()}, Content: $k"))
其中spark
是您的SparkSession
。
作为输出,您将得到:
Example with 2 partitions:
Partition: 0, Content: (0,(field1_a,field2_a))
Partition: 1, Content: (1,(field1_b,field2_b))
Partition: 0, Content: (2,(field1_c,field2_c))
Partition: 1, Content: (3,(field1_d,field2_d))
Partition: 0, Content: (4,(field1_e,field2_e))
Partition: 1, Content: (5,(field1_f,field2_f))
Partition: 0, Content: (6,(field1_g,field2_g))
Partition: 1, Content: (7,(field1_h,field2_h))
Partition: 0, Content: (8,(field1_k,field2_k))
Partition: 1, Content: (9,(field1_l,field2_l))
Partition: 0, Content: (10,(field1_m,field2_m))
Partition: 1, Content: (11,(field1_n,field2_n))
Example with 4 partitions:
Partition: 0, Content: (0,(field1_a,field2_a))
Partition: 0, Content: (4,(field1_e,field2_e))
Partition: 0, Content: (8,(field1_k,field2_k))
Partition: 3, Content: (3,(field1_d,field2_d))
Partition: 3, Content: (7,(field1_h,field2_h))
Partition: 3, Content: (11,(field1_n,field2_n))
Partition: 1, Content: (1,(field1_b,field2_b))
Partition: 1, Content: (5,(field1_f,field2_f))
Partition: 1, Content: (9,(field1_l,field2_l))
Partition: 2, Content: (2,(field1_c,field2_c))
Partition: 2, Content: (6,(field1_g,field2_g))
Partition: 2, Content: (10,(field1_m,field2_m))