Question

我正在使用PySpark，我正在寻找一种方法将RDD随机分成n个公平的部分。鉴于：

RDD = sc.parallelize(range(50))

我的代码：

from itertools import repeat

def split_population_into_parts(rdd):

    N = 4

    weight_part = float(1) / float(N)

    weights_list = list(repeat(weight_part, N))

    repartionned_rdd = rdd.randomSplit(weights = weights_list)

    #And just to check what weights give, I did :
    for i in repartionned_rdd:

        print len(i.collect())


split_population_into_parts(rdd = RDD)

知道权重= [0.25,0.25,0.25,0.25] ，我的代码可以给出例如（作为RDD长度）：

为什么randomSplit不尊重这里的权重？我希望以 12,12,12和14 为例，或 12,12,13和13 。最有效的方法是什么？谢谢！

Answer 1

等重不保证相同数量的记录。它只保证每个对象具有相同的概率分配给特定的子集。

如果记录数量很少，您会看到像这里的波动。这是正常行为。

randomSplit不尊重特定权重PySpark

1 个答案: