我正在寻找一种方法将RDD分成两个或多个RDD,并将获得的结果保存为两个独立的RDD。举例来说:
rdd_test = sc.parallelize(range(50), 1)
我的代码:
def split_population_into_parts(rdd_test):
N = 2
repartionned_rdd = rdd_test.repartition(N).distinct()
rdds_for_testab_populations = repartionned_rdd.glom()
return rdds_for_testab_populations
rdds_for_testab_populations = split_population_into_parts(rdd_test)
给出了:
[[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48], [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49]
现在我想将这里的每个列表关联到一个新的RDD。例如RDD1和RDD2。该怎么办 ?
答案 0 :(得分:1)
我得到了解决方案:
def get_testab_populations_tables(rdds_for_testab_populations):
i = 0
while i < len(rdds_for_testab_populations.collect()):
for testab_table in rdds_for_testab_populations.toLocalIterator():
namespace = globals()
namespace['tAB_%d' % i] = sc.parallelize(testab_table)
i += 1
return;
然后你可以这样做:
print tAB_0.collect()
print tAB_1.collect()
etc.