如何获取df.value
的max(value),在同一天内并按组删除df.max_value
的重复值(排在最前面)?
+---+-------------------+-----+----------+
| id| date|value| date_only|
+---+-------------------+-----+----------+
| J6|2019-10-01 00:00:00| Null|2016-10-01|
| J6|2019-10-01 01:00:00| 1|2016-10-01|
| J6|2019-10-01 12:30:30| 3|2016-10-01|
| J6|2019-10-01 12:30:30| 3|2016-10-01|
| J2|2019-10-06 00:00:00| 9|2016-10-06|
| J2|2019-10-06 09:20:00| 9|2016-10-06|
| J2|2019-10-06 09:20:00| 1|2016-10-06|
| J2|2019-10-06 09:20:00| 9|2016-10-06|
+---+-------------------+-----+----------+
所需数据框:
+---+-------------------+-----+----------+---------+
| id| date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| Null|2016-10-01| 3|
| J6|2019-10-01 01:00:00| 1|2016-10-01| Null|
| J6|2019-10-01 12:30:30| 3|2016-10-01| Null|
| J6|2019-10-01 12:30:30| 3|2016-10-01| Null|
| J2|2019-10-06 00:00:00| 9|2016-10-06| 9|
| J2|2019-10-06 09:20:00| 9|2016-10-06| Null|
| J2|2019-10-06 09:20:00| 1|2016-10-06| Null|
| J2|2019-10-06 09:20:00| 9|2016-10-06| Null|
+---+-------------------+-----+----------+---------+
答案 0 :(得分:1)
使用max()
和row_number()
的组合:
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window
w=Window().partitionBy("id", "date_only").orderBy("date")
df.withColumn('max_value', F.when(F.row_number().over(w)==1, F.max('value')\
.over(Window().partitionBy("id", "date_only")))).show()
+---+-------------------+-----+----------+---------+
| id| date|value| date_only|max_value|
+---+-------------------+-----+----------+---------+
| J6|2019-10-01 00:00:00| null|2016-10-01| 3|
| J6|2019-10-01 01:00:00| 1|2016-10-01| null|
| J6|2019-10-01 12:30:30| 3|2016-10-01| null|
| J6|2019-10-01 12:30:30| 3|2016-10-01| null|
| J2|2019-10-06 00:00:00| 9|2016-10-06| 9|
| J2|2019-10-06 09:20:00| 9|2016-10-06| null|
| J2|2019-10-06 09:20:00| 1|2016-10-06| null|
| J2|2019-10-06 09:20:00| 9|2016-10-06| null|
+---+-------------------+-----+----------+---------+