Question

我有一个RDD of Pairs如下：

(105,918)
(105,757)
(502,516)
(105,137)
(516,816)
(350,502)

我想将它分成两个RDD，使得第一个只有非重复值的对（对于键和值），第二个将具有其余的省略对。

所以从上面我们可以得到两个RDD的

 1) (105,918)
    (502,516)

 2) (105,757) - Omitted as 105 is already included in 1st RDD
    (105,137) - Omitted as 105 is already included in 1st RDD
    (516,816) - Omitted as 516 is already included in 1st RDD
    (350,502) - Omitted as 502 is already included in 1st RDD

目前我正在使用一个可变的Set变量来跟踪在将原始RDD合并到一个分区后已经选择的元素：

val evalCombinations = collection.mutable.Set.empty[String]
val currentValidCombinations = allCombinations
  .filter(p => {
  if(!evalCombinations.contains(p._1) && !evalCombinations.contains(p._2)) {
    evalCombinations += p._1;evalCombinations += p._2; true
  } else
    false
})

此方法受运行运行的执行程序的内存限制。是否有更好的可扩展解决方案？

Answer 1

这是一个版本，需要更多内存用于驱动程序。

import org.apache.spark.rdd._
import org.apache.spark._

def getUniq(rdd: RDD[(Int, Int)], sc: SparkContext): RDD[(Int, Int)] = {

    val keys   = rdd.keys.distinct
    val values = rdd.values.distinct

    // these are the keys which appear in value part also.
    val both = keys.intersection(values) 

    val bBoth = sc.broadcast(both.collect.toSet)

    // remove those key-value pairs which have value which is also a key.
    val uKeys = rdd.filter(x => !bBoth.value.contains(x._2))
               .reduceByKey{ case (v1, v2) => v1 }  // keep uniq keys

    uKeys.map{ case (k, v) => (v, k) }              // swap key, value
         .reduceByKey{ case (v1, v2) => v1 }        // keep uniq value
         .map{ case (k, v) => (v, k) }              // correct placement

}

def getPartitionedRDDs(rdd: RDD[(Int, Int)], sc: SparkContext) = {

    val r = getUniq(rdd, sc)    
    val remaining = rdd subtract r
    val set = r.flatMap { case (k, v) => Array(k, v) }.collect.toSet
    val a = remaining.filter{ case (x, y) => !set.contains(x) && 
                                             !set.contains(y) }
    val b = getUniq(a, sc)
    val part1 = r union b
    val part2 = rdd subtract part1
   (part1, part2)
}

val rdd = sc.parallelize(Array((105,918),(105,757),(502,516),
                               (105,137),(516,816),(350,502)))

val (first, second) = getPartitionedRDDs(rdd, sc)
// first.collect:  ((516,816), (105,918), (350,502))
// second.collect: ((105,137), (502,516), (105,757))

val rdd1 = sc.parallelize(Array((839,841),(842,843),(840,843),
                                (839,840),(1,2),(1,3),(4,3)))

val (f, s) = getPartitionedRDDs(rdd1, sc)
//f.collect: ((839,841), (1,2), (840,843), (4,3))

将RDD拆分为RDD并且没有重复值

1 个答案: