我想将以下RDD拆分为单个RDD(id,(all name same type))
。
>val test = rddByKey.map{case(k,v)=> (k,v.collect())}
test: Array[(String, Array[String])] =
Array(
(45000,Array(Amit, Pavan, Ratan)),
(10000,Array(Kumar, Venkat, Sheela)),
(50000,Array(Tejas, Dinesh, Lokesh, Bhupesh))
)
我想像这样打印:
(45000,(Amit, Pavan, Ratan))
(10000,(Kumar, Venkat, Sheela))
这就是我试过的
val data = sc.textFile("/user/cloudera/data.csv")
val rdd = data.map(r=>(r.split(",")(0),r.split(",")(1)))
val groupByKey = rdd.groupByKey().collect()
val rddByKey = groupByKey.map{case(k,v) => k->sc.makeRDD(v.toSeq)}
val test = rddByKey.map{case(k,v)=> (k,v.collect())}
答案 0 :(得分:0)
您不必经历使用collect
的复杂性。你可以简单地做到
val data = sc.textFile("/user/cloudera/data.csv")
val rdd = data.map(r=>{
val x = r.split(",")
(x(0),x(1))
})
val groupByKey = rdd.groupByKey().map{case (x, y) => (x :: y.toList)}
groupByKey
是
List(45000, Amit, Pavan, Ratan)
List(10000, Kumar, Venkat, Sheela)
List(50000, Tejas, Dinesh, Lokesh, Bhupesh)
我希望答案很有帮助