我有2对RDD,我正在联合提供第三个RDD。 但结果RDD有重复的插图:
rdd3 = {(1,2) , (3,4) , (1,2)}
我想从rdd3
中删除重复的元组,但前提是元组的键值对都是相同的。
我怎么能这样做?
答案 0 :(得分:1)
请直接调用spark-scala lib api:
def distinct(): RDD[T]
请记住,它是带有类型参数的通用方法。
如果使用RDD [(Int,Int)]类型的rdd调用它,它将在你的rdd中给出你不同的类型(Int,Int),就像它一样。
如果要查看此方法的内部。请参阅下面的签名:
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}
答案 1 :(得分:0)
您可以使用distinct例如
val data= sc.parallelize(
Seq(
("Foo","41","US","3"),
("Foo","39","UK","1"),
("Bar","57","CA","2"),
("Bar","72","CA","2"),
("Baz","22","US","6"),
("Baz","36","US","6"),
("Baz","36","US","6")
)
)
删除副本:
val distinctData = data.distinct()
distinctData.collect