根据键将列表设置为数据

时间:2017-04-17 12:13:32

标签: scala apache-spark

我在scala-spark中有一个像这样的RDD数组:

Array[(String,Int)]= Array((A1:B,1), (A1:A,10), (A2:C,5), (A2:E,5), (A3:D,3))

我需要按第一个参数A1 or A2 or A3对其进行分组,因此每个参数都是一个包含数字的列表,如下所示:

List( A1:(1,10), A2:(5,5), A3:(3) )

请帮帮我

2 个答案:

答案 0 :(得分:0)

将其视为RDD,我们可以按照以下方式进行。

scala> val x = List(("A1:B",1),("A1:A",10),("A2:C",5),("A2:E",5),("A3:D",3))
x: List[(String, Int)] = List((A1:B,1), (A1:A,10), (A2:C,5), (A2:E,5), (A3:D,3))
scala> x.map( a=> (a._1.split(":"),a._2))
res1: List[(Array[String], Int)] = List((Array(A1, B),1), (Array(A1, A),10), (Array(A2, C),5), (Array(A2, E),5), (Array(A3, D),3))

scala> res1.map(a => (a._1(0),a._2))
res12: List[(String, Int)] = List((A1,1), (A1,10), (A2,5), (A2,5), (A3,3))

scala> val rdd = sc.makeRDD(res12)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[15] at makeRDD at <console>:33

scala> rdd.groupByKey()
res13: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[16] at groupByKey at <console>:36
scala> res13.collect

res14: Array[(String, Iterable[Int])] = Array((A3,CompactBuffer(3)), (A1,CompactBuffer(1, 10)), (A2,CompactBuffer(5, 5)))

答案 1 :(得分:0)

你可以试试这个:

val data = Array(("A1:B", 1), ("A1:A", 10), ("A2:C", 5), ("A2:E", 5), ("A3:D", 3))
val grpData = data.groupBy(f => f._1.split(":")(0)).map(x => (x._1 + ":(" + x._2.map(_._2).mkString(",") + ")")).toList
println(grpData)