Apache Spark将多行连接到单行列表中

时间:2017-09-29 04:57:28

标签: scala apache-spark hive apache-spark-sql

我需要从源表创建一个表(hive表/ spark数据帧),该表将多行中的用户数据存储到单行列表中。

User table:
Schema:  userid: string | transactiondate:string | charges: string |events:array<struct<name:string,value:string>> 
----|------------|-------| ---------------------------------------
123 | 2017-09-01 | 20.00 | [{"name":"chargeperiod","value":"this"}]
123 | 2017-09-01 | 30.00 | [{"name":"chargeperiod","value":"last"}]
123 | 2017-09-01 | 20.00 | [{"name":"chargeperiod","value":"recent"}]
123 | 2017-09-01 | 30.00 | [{"name":"chargeperiod","value":"0"}]
456 | 2017-09-01 | 20.00 | [{"name":"chargeperiod","value":"this"}]
456 | 2017-09-01 | 30.00 | [{"name":"chargeperiod","value":"last"}]
456 | 2017-09-01 | 20.00 | [{"name":"chargeperiod","value":"recent"}]
456 | 2017-09-01 | 30.00 | [{"name":"chargeperiod","value":"0"}]

输出表应为

userid:String | concatenatedlist :List[Row]
-------|-----------------
123    | [[2017-09-01,20.00,[{"name":"chargeperiod","value":"this"}]],[2017-09-01,30.00,[{"name":"chargeperiod","value":"last"}]],[2017-09-01,20.00,[{"name":"chargeperiod","value":"recent"}]], [2017-09-01,30.00, [{"name":"chargeperiod","value":"0"}]]]
456    | [[2017-09-01,20.00,[{"name":"chargeperiod","value":"this"}]],[2017-09-01,30.00,[{"name":"chargeperiod","value":"last"}]],[2017-09-01,20.00,[{"name":"chargeperiod","value":"recent"}]], [2017-09-01,30.00, [{"name":"chargeperiod","value":"0"}]]]

Spark版本:1.6.2

1 个答案:

答案 0 :(得分:3)

val rdd = sc.parallelize(Seq(("1","2017-02-01","20.00","abc"),("1","2017-02-01","30.00","abc2"),("2","2017-02-01","20.00","abc"),("2","2017-02-01","30.00","abc")))
val df = rdd.toDF("id","date","amt","array")
df.withColumn("new",concat_ws(",",$"date",$"amt",$"array")).select("id","new").groupBy("id").agg(concat_ws(",",collect_list("new")))