我想创建一个表,其中每一行都是唯一的ID,“地点和城市”列包含一个人访问过的所有地点和城市(按访问日期排序),可以使用Pyspark或Hive。
df.groupby("ID").agg(F.concat_ws("|",F.collect_list("Place")))
进行并置,但无法按日期订购。另外,对于每一列,我都需要继续单独执行此步骤。
我也尝试使用本文中提到的Windows函数(collect_list by preserving order based on another variable) 但是抛出错误:java.lang.UnsupportedOperationException:窗口操作不支持'collect_list(')。 我要:
1-按旅行日期的顺序对连接的列进行排序
2-对多列执行此步骤
数据
| ID | Date | Place | City |
| 1 | 2017 | UK | Birm |
| 2 | 2014 | US | LA |
| 1 | 2018 | SIN | Sin |
| 1 | 2019 | MAL | KL |
| 2 | 2015 | US | SF |
| 3 | 2019 | UK | Lon |
预期
| ID | Place | City |
| 1 | UK,SIN,MAL | Birm,Sin,KL |
| 2 | US,US | LA,SF |
| 3 | UK | Lon |
答案 0 :(得分:2)
>>> from pyspark.sql import functions as F
>>> from pyspark.sql import Window
>>> w = Window.partitionBy('ID').orderBy('Date')
//Input data frame
>>> df.show()
+---+----+-----+----+
| ID|Date|Place|City|
+---+----+-----+----+
| 1|2017| UK|Birm|
| 2|2014| US| LA|
| 1|2018| SIN| Sin|
| 1|2019| MAL| KL|
| 2|2015| US| SF|
| 3|2019| UK| Lon|
+---+----+-----+----+
>>> df2 = df.withColumn("Place",F.collect_list("Place").over(w)).withColumn("City",F.collect_list("City").over(w)).groupBy("ID").agg(F.max("Place").alias("Place"), F.max("City").alias("City"))
//Data value in List
>>> df2.show()
+---+--------------+---------------+
| ID| Place| City|
+---+--------------+---------------+
| 3| [UK]| [Lon]|
| 1|[UK, SIN, MAL]|[Birm, Sin, KL]|
| 2| [US, US]| [LA, SF]|
+---+--------------+---------------+
//If you want value in String
>>> df2.withColumn("Place", F.concat_ws(" ", "Place")).withColumn("City", F.concat_ws(" ", "City")).show()
+---+----------+-----------+
| ID| Place| City|
+---+----------+-----------+
| 3| UK| Lon|
| 1|UK SIN MAL|Birm Sin KL|
| 2| US US| LA SF|
+---+----------+-----------+