使用PySpark数据帧我正在尝试为分类变量开发序列。 因此,对于每个ID,我需要按日期排序并组合每个分类列的列值,如下面的输出所示。对于特定ID,分类序列值应该是从受影响的记录的日期时间的开始到当前日期时间,并且每个记录的值应该是空间分隔的。 请注意,此处不应丢失任何记录。
INPUT:
+----+------------------------------+------+-------+------+
| id | date | cat1 | cat2 | cat3 |
+----+------------------------------+------+-------+------+
| 1 | 2018-01-25 00:00:... | C | Text1 | val1 |
| 1 | 2018-01-25 00:00:... | A | Text1 | val3 |
| 1 | 2018-01-25 00:00:... | B | Text5 | val5 |
| 2 | 2018-01-26 00:00:... | A | Text2 | val1 |
| 2 | 2018-01-26 00:00:... | A | Text1 | val2 |
| 3 | 2018-01-27 00:00:... | C | Text6 | val1 |
| 3 | 2018-01-29 00:00:... | A | Text2 | val9 |
| 3 | 2018-01-29 00:00:... | C | Text6 | val5 |
| 3 | 2018-02-05 00:00:... | A | Text1 | val3 |
+----+------------------------------+------+-------+------+
输出:
+----+------------------------------+----------+-------------------------+---------------------+
| id | date | cat1_seq | cat2_seq | cat3_seq |
+----+------------------------------+----------+-------------------------+---------------------+
| 1 | 2018-01-25 00:00:... | C | Text1 | val1 |
| 1 | 2018-01-25 00:00:... | C A | Text1 Text1 | val1 val3 |
| 1 | 2018-01-25 00:00:... | C A B | Text1 Text1 Text5 | val1 val3 val5 |
| 2 | 2018-01-26 00:00:... | A | Text2 | val1 |
| 2 | 2018-01-26 00:00:... | A A | Text2 Text1 | val1 val2 |
| 3 | 2018-01-27 00:00:... | C | Text6 | val1 |
| 3 | 2018-01-29 00:00:... | C A | Text6 Text2 | val1 val9 |
| 3 | 2018-01-29 00:00:... | C A C | Text6 Text2 Text6 | val1 val9 val5 |
| 3 | 2018-02-05 00:00:... | C A C A | Text6 Text2 Text6 Text1 | val1 val9 val5 val3 |
+----+------------------------------+----------+-------------------------+---------------------+
解决: 我已经使用了@Pault建议的以下代码来完成它 -
import pyspark.sql.functions as f
from pyspark.sql import Window
df1 = df.select("*", *[f.concat_ws(" ",f.collect_list(c).over(Window.partitionBy("id").orderBy("date"))).alias(c + "_seq") for c in df.columns])