比较两个RDDS中的数据

时间:2018-12-05 16:45:38

标签: scala apache-spark pyspark apache-spark-sql rdd

rdd1

(m1,p1)
(m1,p2)
(m1,p3)
(m2,p1)
(m2,p2)
(m2,p3)
(m2,p4)

rdd2

(m1,p1)
(m1,p2)
(m1,p3)
(m2,p1)
(m2,p2)
(m2,p3)

我有两个rdds rdd1rdd2,我想比较两个rdds并打印差异,即(m2,p4)rdd2中不存在。

我尝试了rdd1.substractbykey(rdd2)rdd1.substract(rdd2)我没有任何数据,请协助

3 个答案:

答案 0 :(得分:0)

尝试一下-

rdd1:

(m1,p1) (m1,p2) (m1,p3) (m2,p1) (m2,p2) (m2,p3) (m2,p4)

rdd2:

(m1,p1) (m1,p2) (m1,p3) (m2,p1) (m2,p2) (m2,p3)

答案 1 :(得分:0)

您可以在数据框中使用full outer join

def find_not_null(row):
     if(row['col1'] is None):
         return (row['col3'], row['col4'])
     else:
         return (row['col1'], row['col2'])

diff_rdd = rdd1.toDF(['col1', 'col2']). \
   join(rdd1.toDF(['col3', 'col4']), \ 
        col('col1') == col('col2') and col('col3') == col('col4'), \
        'full_outer'). \
   filter(lambda x: x['col1'] is None or x['col3'] is None).rdd. \
   map(find_not_null)

答案 2 :(得分:0)

如果您确实需要RDD,则可以使用subtractunion获得结果。

假设您对双方的差异都感兴趣,那么这将起作用:

val left = sc.makeRDD(Seq(("m1","p1"), ("m1","p2"), ("m1","p3"), ("m2","p1"), ("m2","p2"), ("m2","p3"), ("m2","p4")))
val right = sc.makeRDD(Seq(("m1","p1"), ("m1","p2"), ("m1","p3"), ("m2","p1"), ("m2","p2"), ("m2","p3"), ("m3","p1")))

val output = left.subtract(right).union(right.subtract(left))
output.collect() // Array[(String, String)] = Array((m2,p4), (m3,p1))

另一方面,如果不介意在内存中保留“完全外部联接”,则可以使用cogroup实现相同的目的:

val output = left.cogroup(right).flatMap { case (k, (i1, i2)) => 
  val s1 = i1.toSet
  val s2 = i2.toSet
  val diff = (s1 diff s2) ++ (s2 diff s1)
  diff.toList.map(k -> _)
}
output.collect() // Array[(String, String)] = Array((m2,p4), (m3,p1))