Python spark dataframe groupby根据另一列

时间:2018-02-13 14:42:16

标签: python apache-spark dataframe pyspark spark-dataframe

我在Python中有这样的Spark数据框

df = sc.parallelize([
("id1", datetime(2017, 1, 1, 0, 0), 1, 2, "e1"),
("id2", datetime(2017, 2, 2, 0, 1), 1, 1, "e2"),
("id3", datetime(2017, 1, 3, 0, 0), 3, 2, "e1"),
("id1", datetime(2017, 1, 1, 0, 1), 2, 2, "e2"),
("id2", datetime(2017, 2, 2, 0, 0), 2, 2, "e1"),
("id1", datetime(2017, 1, 1, 0, 2), 0, 4, "e3"),
]).toDF(["pcb", "date", "start", "end", "el_list"]).withColumn("date", 
col("date").cast(TimestampType()))

df.show(truncate=False)


+---+---------------------+-----+---+-------+
|pcb|date                 |start|end|el_list|
+---+---------------------+-----+---+-------+
|id1|2017-01-01 00:00:00.0|1    |2  |e1     |
|id2|2017-02-02 00:01:00.0|1    |1  |e2     |
|id3|2017-01-03 00:00:00.0|3    |2  |e1     |
|id1|2017-01-01 00:01:00.0|2    |2  |e2     |
|id2|2017-02-02 00:00:00.0|2    |2  |e1     |
|id1|2017-01-01 00:02:00.0|0    |4  |e3     |
+---+---------------------+-----+---+-------+

我想按" pcb"分组然后采用以下值:

  • 最长日期(易于办事)
  • "开始"值与最小值相关联" date"为了那个" pcb"
  • " end"值与最大值" date"相关联为了那个" pcb"
  • " el_list"在一个唯一的集合列表中( - > pyspark.sql.functions.collect_set)

应该是:

+---+---------------------+-----+---+--------+
|pcb|date                 |start|end|el_list |
+---+---------------------+-----+---+--------+
|id1|2017-01-01 00:00:00.0|1    |4  |e1,e2,e3|
|id2|2017-02-02 00:01:00.0|2    |1  |e1,e2   |
|id3|2017-01-03 00:00:00.0|3    |2  |e1      |
+---+---------------------+-----+---+--------+

我可以使用哪种功能?

0 个答案:

没有答案