Pyspark-删除组重复项并保留第一行

时间:2020-10-08 20:07:56

标签: python apache-spark pyspark

如何获取df.value的max(value),在同一天内并按组删除df.max_value的重复值(排在最前面)?

+---+-------------------+-----+----------+
| id|               date|value| date_only|
+---+-------------------+-----+----------+
| J6|2019-10-01 00:00:00| Null|2016-10-01| 
| J6|2019-10-01 01:00:00|    1|2016-10-01|
| J6|2019-10-01 12:30:30|    3|2016-10-01|
| J6|2019-10-01 12:30:30|    3|2016-10-01|
| J2|2019-10-06 00:00:00|    9|2016-10-06|
| J2|2019-10-06 09:20:00|    9|2016-10-06|
| J2|2019-10-06 09:20:00|    1|2016-10-06|
| J2|2019-10-06 09:20:00|    9|2016-10-06|
+---+-------------------+-----+----------+

所需数据框:

+---+-------------------+-----+----------+---------+
| id|               date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| Null|2016-10-01|        3|
| J6|2019-10-01 01:00:00|    1|2016-10-01|     Null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     Null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     Null|
| J2|2019-10-06 00:00:00|    9|2016-10-06|        9|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     Null|
| J2|2019-10-06 09:20:00|    1|2016-10-06|     Null|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     Null|
+---+-------------------+-----+----------+---------+

1 个答案:

答案 0 :(得分:1)

使用max()row_number()的组合:

from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window

w=Window().partitionBy("id", "date_only").orderBy("date")

df.withColumn('max_value', F.when(F.row_number().over(w)==1, F.max('value')\
        .over(Window().partitionBy("id", "date_only")))).show()

+---+-------------------+-----+----------+---------+
| id|               date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| null|2016-10-01|        3|
| J6|2019-10-01 01:00:00|    1|2016-10-01|     null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     null|
| J6|2019-10-01 12:30:30|    3|2016-10-01|     null|
| J2|2019-10-06 00:00:00|    9|2016-10-06|        9|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     null|
| J2|2019-10-06 09:20:00|    1|2016-10-06|     null|
| J2|2019-10-06 09:20:00|    9|2016-10-06|     null|
+---+-------------------+-----+----------+---------+