Question

在Apache Spark中，

repartition(n) - 允许将RDD划分为精确的n分区。

但是如何将给定的RDD划分为分区，使得所有分区（最后一个分区的例外）都具有指定数量的元素。鉴于RDD中的元素数量未知且执行.count()非常昂贵。

C = sc.parallelize([x for x in range(10)],2)
Let's say internally,  C = [[0,1,2,3,4,5], [6,7,8,9]]  
C = someCode(3)

预期：

C = [[0,1,2], [3,4,5], [6, 7, 8], [9]]

Answer 1

在pyspark非常容易：

    C = sc.parallelize([x for x in range(10)],2)
    rdd = C.map(lambda x : (x, x))
    C_repartitioned = rdd.partitionBy(4,lambda x: int( x *4/11)).map(lambda x: x[0]).glom().collect()
    C_repartitioned

    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]

它被称为自定义分区。更多相关内容：http://sparkdatasourceapi.blogspot.ru/2016/10/patitioning-in-spark-writing-custom.html

http://baahu.in/spark-custom-partitioner-java-example/

将RDD划分为每个分区中具有固定数量元素的分区

1 个答案: