如何将Spark rdd Array [(String,Array [String])]拆分为单个RDD

时间:2018-03-13 01:36:27

标签: scala apache-spark rdd

我想将以下RDD拆分为单个RDD(id,(all name same type))

>val test = rddByKey.map{case(k,v)=> (k,v.collect())}  

test: Array[(String, Array[String])] =   
  Array(
    (45000,Array(Amit, Pavan, Ratan)),
    (10000,Array(Kumar, Venkat, Sheela)), 
    (50000,Array(Tejas, Dinesh, Lokesh, Bhupesh))
  )

我想像这样打印:

(45000,(Amit, Pavan, Ratan))
(10000,(Kumar, Venkat, Sheela))

这就是我试过的

val data = sc.textFile("/user/cloudera/data.csv") 
val rdd = data.map(r=>(r.split(",")(0),r.split(",")(1))) 
val groupByKey = rdd.groupByKey().collect() 
val rddByKey = groupByKey.map{case(k,v) => k->sc.makeRDD(v.toSeq)} 
val test = rddByKey.map{case(k,v)=> (k,v.collect())}

1 个答案:

答案 0 :(得分:0)

您不必经历使用collect的复杂性。你可以简单地做到

val data = sc.textFile("/user/cloudera/data.csv")
val rdd = data.map(r=>{
  val x = r.split(",")
  (x(0),x(1))
})
val groupByKey = rdd.groupByKey().map{case (x, y) => (x :: y.toList)}

groupByKey

List(45000, Amit, Pavan, Ratan)
List(10000, Kumar, Venkat, Sheela)
List(50000, Tejas, Dinesh, Lokesh, Bhupesh)

我希望答案很有帮助