获取所有rdds中显示的项目 - Pyspark

时间:2017-05-16 11:31:11

标签: python apache-spark pyspark

我是新手,我试图过滤包含所有其他rdds中出现的项目的最终报道。

我的代码

a = ['rs1','rs2','rs3','rs4','rs5']
b = ['rs3','rs7','rs10','rs4','rs6']
c = ['rs10','rs13','rs20','rs16','rs1']
d = ['rs2', 'rs4', 'rs5', 'rs13', 'rs3']

a_rdd = spark.parallelize(a)
b_rdd = spark.parallelize(b)
c_rdd = spark.parallelize(c)
d_rdd = spark.parallelize(d)

rdd = spark.union([a_rdd, b_rdd, c_rdd, d_rdd]).distinct()

结果:[' rs4',' rs16',' rs5',' rs6',' rs7' ,'     RS3']

我的预期结果是[' rs3',' rs4']

谢谢!!!

1 个答案:

答案 0 :(得分:1)

当你说你想要一个包含所有rdds项目的rdd时,你的意思是交叉点?如果是这种情况你不应该使用联合并且你的rdds的交集是空的(在你的4个rdds中没有重复的元素)

但是如果你需要做你的rdds的交集:

    def intersection(*args):
         return reduce(lambda x,y:x.intersection(y),args)

    a = ['rs1','rs2','rs3','rs4','rs5']
    b = ['rs3','rs7','rs1','rs2','rs6']
    c = ['rs10','rs13','rs2','rs16','rs1']
    d = ['rs2', 'rs4', 'rs1', 'rs13', 'rs3']

    a_rdd = sc.parallelize(a)
    b_rdd = sc.parallelize(b)
    c_rdd = sc.parallelize(c)
    d_rdd = sc.parallelize(d)

    rdd = sc.union([a_rdd, b_rdd, c_rdd, d_rdd]).distinct()
    intersection(a_rdd, b_rdd, c_rdd, d_rdd).collect()

输出为['rs1','rs2']