阅读this帖子我想知道如何对数据集进行分组,但是有多列。
像:
val test = Seq(("New York", "Jack", "jdhj"),
("Los Angeles", "Tom", "ff"),
("Chicago", "David", "ff"),
("Houston", "John", "dd"),
("Detroit", "Michael", "fff"),
("Chicago", "Andrew", "ddd"),
("Detroit", "Peter", "dd"),
("Detroit", "George", "dkdjkd")
)
我想得到
芝加哥,[(" David"," ff"),(" Andrew"," ddd")]
答案 0 :(得分:1)
创建一个案例类,如下所示
case class TestData (location: String, name: String, value: String)
虚拟数据
val test = Seq(("New York", "Jack", "jdhj"),
("Los Angeles", "Tom", "ff"),
("Chicago", "David", "ff"),
("Houston", "John", "dd"),
("Detroit", "Michael", "fff"),
("Chicago", "Andrew", "ddd"),
("Detroit", "Peter", "dd"),
("Detroit", "George", "dkdjkd")
)
//change each row to TestData object
.map(x => TestData(x._1, x._2, x._3))
.toDS() // create dataset from above data
根据需要输出
test.groupBy($"location")
.agg(collect_list(struct("name", "value")).as("data"))
.show(false)
输出:
+-----------+--------------------------------------------+
|location |data |
+-----------+--------------------------------------------+
|Los Angeles|[[Tom,ff]] |
|Detroit |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
|Chicago |[[David,ff], [Andrew,ddd]] |
|Houston |[[John,dd]] |
|New York |[[Jack,jdhj]] |
+-----------+--------------------------------------------+
答案 1 :(得分:0)
我已在the link中建议您在问题中提供的case class
方式。这里有些不同。
RDD方式
您可以简单地执行以下操作
val rdd = sc.parallelize(test) //creating rdd from test
val resultRdd = rdd.groupBy(x => x._1) //grouping by the first element
.mapValues(x => x.map(y => (y._2, y._3))) //collecting the second and third element in the grouped datset
resultRdd.foreach(println)
应该给你
(New York,List((Jack,jdhj)))
(Houston,List((John,dd)))
(Chicago,List((David,ff), (Andrew,ddd)))
(Detroit,List((Michael,fff), (Peter,dd), (George,dkdjkd)))
(Los Angeles,List((Tom,ff)))
将rdd转换为dataframe
如果您需要以表格格式输出,则可以在执行某些操作后调用.toDF()
val df = resultRdd.map(x => (x._1, x._2.toArray)).toDF()
df.show(false)
应该给你
+-----------+--------------------------------------------+
|_1 |_2 |
+-----------+--------------------------------------------+
|New York |[[Jack,jdhj]] |
|Houston |[[John,dd]] |
|Chicago |[[David,ff], [Andrew,ddd]] |
|Detroit |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
|Los Angeles|[[Tom,ff]] |
+-----------+--------------------------------------------+