Spark [Scala中的数组[RDD [(String,Set [String])]]转换

时间:2016-03-14 05:58:51

标签: scala apache-spark apache-spark-sql

我有一个类型为Array的RDD数组[RDD [(String,Set [String])]],其中每个RDD都是键和值的元组。 Key是String,Value是Set [String],我想用同一个键合并/联合Set。我试图在scala中这样做,但没有快乐。你能帮帮我吗?

e.g.
RDD["A",Set("1","2")]
RDD["A",Set("3","4")]
RDD["B",Set("1","2")]
RDD["B",Set("3","4")]
RDD["C",Set("1","2")]
RDD["C",Set("3","4")]

After transformation:
RDD["A",Set("1","2","3","4")]
RDD["B",Set("1","2","3","4")]
RDD["C",Set("1","2","3","4")]

1 个答案:

答案 0 :(得分:2)

如果单个RDD输出正常(没有任何理由可以制作许多RDD且其中只有1条记录),则可以减少Array RDD }进入一个RDD,然后执行groupByKey

arr.reduce( _ ++ _ )
   .groupByKey
   .mapValues(_.flatMap(identity))

Exampe:

scala> val x = sc.parallelize( List( ("A", Set(1,2)) ) )
scala> val x2 = sc.parallelize( List( ("A", Set(3,4)) ) )
scala> val arr = Array(x,x2)
arr: Array[org.apache.spark.rdd.RDD[(String, scala.collection.immutable.Set[Int])]] = Array(ParallelCollectionRDD[0] at parallelize at <console>:27, ParallelCollectionRDD[1] at parallelize at <console>:27)
scala> arr.reduce( _ ++ _ ).groupByKey.mapValues(_.flatMap(identity)).foreach(println)
(A,List(1, 2, 3, 4))

@Edit:我发现这是一个非常糟糕的主意,建议你重新考虑它,但你可以通过从上面获取所有密钥并多次过滤RDD来获得你想要的结果:

val sub = arr.reduce( _ ++ _ ).groupByKey.mapValues(_.flatMap(identity))
val keys = sub.map(_._1).collect()
val result = for(k <- keys) yield sub.filter(_._1 == k)
result: Array[org.apache.spark.rdd.RDD[(String, Iterable[Int])]]

每个RDD都会有一个元组,不要认为它非常有用,效果很好。