将两个键值集合与Spark有效结合

时间:2015-07-31 19:15:37

标签: scala apache-spark

我有以下键值对列表(如hashmap,但不完全在spark上下文中):

val m1 = sc.parallelize(List(1 -> "a", 2 -> "b", 3 -> "c", 4 -> "d"))
val m2 = sc.parallelize(List(1 -> "A", 2 -> "B", 3 -> "C", 5 -> "E"))

我想得到这样的东西,并且如果有效并行(甚至不知道是否可能)

List(1 -> (Some("a"), Some("A")), 2 -> (Some("b"), Some("B")), 3 -> (Some("c"), Some("C")), 4 -> (Some("d"), None), 5 -> (None, Some("E")))

或者至少

List(1 -> ("a","A"), 2 -> ("b","B"), 3 -> ("c","C"))

如何实现这一目标?据我了解 - 我没有有效的方法通过密钥从“地图”获取值 - 这些不是真正的哈希图。

4 个答案:

答案 0 :(得分:2)

You can use the fullOuterJoin function:

val m1: RDD[(Int, String)] = //...
val m2: RDD[(Int, String)] = //...
val j: RDD[(Int, (Option[String], Option[String]))] = m1.fullOuterJoin(m2)

Depending on your use, you can use any variation of joins:

val full:  RDD[(Int, (Option[String], Option[String]))] = m1.fullOuterJoin(m2)
val left:  RDD[(Int, (String, Option[String]))]         = m1.leftOuterJoin(m2)       
val right: RDD[(Int, (Option[String], String))]         = m1.rightOuterJoin(m2)
val join:  RDD[(Int, (String, String))]                 = m1.join(m2)

答案 1 :(得分:1)

A simple join should work:

rdd1.join(rdd2) //RDD[K, (V1,V2)]

答案 2 :(得分:1)

批准的答案是正确的,如果您想了解它,请查看此键/值对文章Key/Value pairs

答案 3 :(得分:0)

或者,使用union ...

rdd1.union(rdd2).groupByKey()