我有以下键值对列表(如hashmap,但不完全在spark上下文中):
val m1 = sc.parallelize(List(1 -> "a", 2 -> "b", 3 -> "c", 4 -> "d"))
val m2 = sc.parallelize(List(1 -> "A", 2 -> "B", 3 -> "C", 5 -> "E"))
我想得到这样的东西,并且如果有效并行(甚至不知道是否可能)
List(1 -> (Some("a"), Some("A")), 2 -> (Some("b"), Some("B")), 3 -> (Some("c"), Some("C")), 4 -> (Some("d"), None), 5 -> (None, Some("E")))
或者至少
List(1 -> ("a","A"), 2 -> ("b","B"), 3 -> ("c","C"))
如何实现这一目标?据我了解 - 我没有有效的方法通过密钥从“地图”获取值 - 这些不是真正的哈希图。
答案 0 :(得分:2)
You can use the fullOuterJoin
function:
val m1: RDD[(Int, String)] = //...
val m2: RDD[(Int, String)] = //...
val j: RDD[(Int, (Option[String], Option[String]))] = m1.fullOuterJoin(m2)
Depending on your use, you can use any variation of joins:
val full: RDD[(Int, (Option[String], Option[String]))] = m1.fullOuterJoin(m2)
val left: RDD[(Int, (String, Option[String]))] = m1.leftOuterJoin(m2)
val right: RDD[(Int, (Option[String], String))] = m1.rightOuterJoin(m2)
val join: RDD[(Int, (String, String))] = m1.join(m2)
答案 1 :(得分:1)
A simple join
should work:
rdd1.join(rdd2) //RDD[K, (V1,V2)]
答案 2 :(得分:1)
批准的答案是正确的,如果您想了解它,请查看此键/值对文章Key/Value pairs
答案 3 :(得分:0)
或者,使用union ...
rdd1.union(rdd2).groupByKey()