rdd1
:
(m1,p1)
(m1,p2)
(m1,p3)
(m2,p1)
(m2,p2)
(m2,p3)
(m2,p4)
rdd2
:
(m1,p1)
(m1,p2)
(m1,p3)
(m2,p1)
(m2,p2)
(m2,p3)
我有两个rdds rdd1
和rdd2
,我想比较两个rdds并打印差异,即(m2,p4)
在rdd2
中不存在。
我尝试了rdd1.substractbykey(rdd2)
和rdd1.substract(rdd2)
我没有任何数据,请协助
答案 0 :(得分:0)
尝试一下-
rdd1:
(m1,p1) (m1,p2) (m1,p3) (m2,p1) (m2,p2) (m2,p3) (m2,p4)
rdd2:
(m1,p1) (m1,p2) (m1,p3) (m2,p1) (m2,p2) (m2,p3)
答案 1 :(得分:0)
您可以在数据框中使用full outer join
:
def find_not_null(row):
if(row['col1'] is None):
return (row['col3'], row['col4'])
else:
return (row['col1'], row['col2'])
diff_rdd = rdd1.toDF(['col1', 'col2']). \
join(rdd1.toDF(['col3', 'col4']), \
col('col1') == col('col2') and col('col3') == col('col4'), \
'full_outer'). \
filter(lambda x: x['col1'] is None or x['col3'] is None).rdd. \
map(find_not_null)
答案 2 :(得分:0)
如果您确实需要RDD
,则可以使用subtract
和union
获得结果。
假设您对双方的差异都感兴趣,那么这将起作用:
val left = sc.makeRDD(Seq(("m1","p1"), ("m1","p2"), ("m1","p3"), ("m2","p1"), ("m2","p2"), ("m2","p3"), ("m2","p4")))
val right = sc.makeRDD(Seq(("m1","p1"), ("m1","p2"), ("m1","p3"), ("m2","p1"), ("m2","p2"), ("m2","p3"), ("m3","p1")))
val output = left.subtract(right).union(right.subtract(left))
output.collect() // Array[(String, String)] = Array((m2,p4), (m3,p1))
另一方面,如果不介意在内存中保留“完全外部联接”,则可以使用cogroup
实现相同的目的:
val output = left.cogroup(right).flatMap { case (k, (i1, i2)) =>
val s1 = i1.toSet
val s2 = i2.toSet
val diff = (s1 diff s2) ++ (s2 diff s1)
diff.toList.map(k -> _)
}
output.collect() // Array[(String, String)] = Array((m2,p4), (m3,p1))