我是新手,我试图过滤包含所有其他rdds中出现的项目的最终报道。
我的代码
a = ['rs1','rs2','rs3','rs4','rs5']
b = ['rs3','rs7','rs10','rs4','rs6']
c = ['rs10','rs13','rs20','rs16','rs1']
d = ['rs2', 'rs4', 'rs5', 'rs13', 'rs3']
a_rdd = spark.parallelize(a)
b_rdd = spark.parallelize(b)
c_rdd = spark.parallelize(c)
d_rdd = spark.parallelize(d)
rdd = spark.union([a_rdd, b_rdd, c_rdd, d_rdd]).distinct()
结果:[' rs4',' rs16',' rs5',' rs6',' rs7' ,'     RS3']
我的预期结果是[' rs3',' rs4']
谢谢!!!
答案 0 :(得分:1)
当你说你想要一个包含所有rdds项目的rdd时,你的意思是交叉点?如果是这种情况你不应该使用联合并且你的rdds的交集是空的(在你的4个rdds中没有重复的元素)
但是如果你需要做你的rdds的交集:
def intersection(*args):
return reduce(lambda x,y:x.intersection(y),args)
a = ['rs1','rs2','rs3','rs4','rs5']
b = ['rs3','rs7','rs1','rs2','rs6']
c = ['rs10','rs13','rs2','rs16','rs1']
d = ['rs2', 'rs4', 'rs1', 'rs13', 'rs3']
a_rdd = sc.parallelize(a)
b_rdd = sc.parallelize(b)
c_rdd = sc.parallelize(c)
d_rdd = sc.parallelize(d)
rdd = sc.union([a_rdd, b_rdd, c_rdd, d_rdd]).distinct()
intersection(a_rdd, b_rdd, c_rdd, d_rdd).collect()
输出为['rs1','rs2']