我在Python中有这样的Spark数据框
df = sc.parallelize([
("id1", datetime(2017, 1, 1, 0, 0), 1, 2, "e1"),
("id2", datetime(2017, 2, 2, 0, 1), 1, 1, "e2"),
("id3", datetime(2017, 1, 3, 0, 0), 3, 2, "e1"),
("id1", datetime(2017, 1, 1, 0, 1), 2, 2, "e2"),
("id2", datetime(2017, 2, 2, 0, 0), 2, 2, "e1"),
("id1", datetime(2017, 1, 1, 0, 2), 0, 4, "e3"),
]).toDF(["pcb", "date", "start", "end", "el_list"]).withColumn("date",
col("date").cast(TimestampType()))
df.show(truncate=False)
+---+---------------------+-----+---+-------+
|pcb|date |start|end|el_list|
+---+---------------------+-----+---+-------+
|id1|2017-01-01 00:00:00.0|1 |2 |e1 |
|id2|2017-02-02 00:01:00.0|1 |1 |e2 |
|id3|2017-01-03 00:00:00.0|3 |2 |e1 |
|id1|2017-01-01 00:01:00.0|2 |2 |e2 |
|id2|2017-02-02 00:00:00.0|2 |2 |e1 |
|id1|2017-01-01 00:02:00.0|0 |4 |e3 |
+---+---------------------+-----+---+-------+
我想按" pcb"分组然后采用以下值:
应该是:
+---+---------------------+-----+---+--------+
|pcb|date |start|end|el_list |
+---+---------------------+-----+---+--------+
|id1|2017-01-01 00:00:00.0|1 |4 |e1,e2,e3|
|id2|2017-02-02 00:01:00.0|2 |1 |e1,e2 |
|id3|2017-01-03 00:00:00.0|3 |2 |e1 |
+---+---------------------+-----+---+--------+
我可以使用哪种功能?