我的函数(test_rdd.cartesian(test_rdd))返回值对的RDD,如下所示:
[((1, 0), (1, 0)),
((1, 0), (2, 0)),
((1, 0), (3, 0)),
((2, 0), (1, 0)),
((2, 0), (2, 0)),
((2, 0), (3, 0)),
((3, 0), (1, 0)),
((3, 0), (2, 0)),
((3, 0), (3, 0))]
我需要删除两个元素相等的条目(例如:...,(((1,0),(1,0)),...)。
当我刚开始使用rdd和spark时,我可能会缺少一些非常基本的东西。
你能给我个主意吗?
答案 0 :(得分:0)
您可以为Scala尝试以下代码:
val array = Array(((1, 0), (1, 0)),
((1, 0), (2, 0)),
((1, 0), (3, 0)),
((2, 0), (1, 0)),
((2, 0), (2, 0)),
((2, 0), (3, 0)),
((3, 0), (1, 0)),
((3, 0), (2, 0)),
((3, 0), (3, 0)))
val rdd = sc.parallelize(array) // creating RDD
val filteredRDD = rdd.filter(row => row._1 != row._2) //accessing element in tuple
filteredRDD.collect() // calling action
结果:
Array[((Int, Int), (Int, Int))] = Array(((1,0),(2,0)), ((1,0),(3,0)), ((2,0),(1,0)), ((2,0),(3,0)), ((3,0),(1,0)), ((3,0),(2,0)))
对于Pyspark,可以使用以下代码:
array = [((1, 0), (1, 0)), ((1, 0), (2, 0)), ((1, 0), (3, 0)), ((2, 0), (1, 0)), ((2, 0), (2, 0)), ((2, 0), (3, 0)), ((3, 0), (1, 0)), ((3, 0), (2, 0)), ((3, 0), (3, 0))]
rdd = sc.parallelize(array) # creating RDD
filteredRDD = rdd.filter(lambda row : row[0] != row[1]) #accessing element in tuple
filteredRDD.collect() # calling action
结果:
[((1, 0), (2, 0)),
((1, 0), (3, 0)),
((2, 0), (1, 0)),
((2, 0), (3, 0)),
((3, 0), (1, 0)),
((3, 0), (2, 0))]