专家,我试图在pyspark数据框中执行一种扫描操作,其中我根据关键组中的下一条记录在记录上标记结束日期。这就是我的数据框的样子-
+---+----+----+-------------------+-------------------+
|Key|col1|col2| effective_date| end_date|
+---+----+----+-------------------+-------------------+
| X| ABC| DEF|2020-08-01 00:00:00|2999-12-31 00:00:00|
| X|ABC1|DEF1|2020-08-03 00:00:00|2999-12-31 00:00:00|
| X|ABC2|DEF2|2020-08-05 00:00:00|2999-12-31 00:00:00|
| Y| PQR| STU|2020-08-07 00:00:00|2999-12-31 00:00:00|
| Y|PQR1|STU1|2020-08-09 00:00:00|2999-12-31 00:00:00|
+---+----+----+-------------------+-------------------+
期望的出站-
+---+----+----+-------------------+-------------------+
|Key|col1|col2| effective_date| end_date|
+---+----+----+-------------------+-------------------+
| X| ABC| DEF|2020-08-01 00:00:00|2020-08-02 23:59:59|
| X|ABC1|DEF1|2020-08-03 00:00:00|2020-08-04 23:59:59|
| X|ABC2|DEF2|2020-08-05 00:00:00|2999-12-31 00:00:00|
| Y| PQR| STU|2020-08-07 00:00:00|2020-08-08 23:59:59|
| Y|PQR1|STU1|2020-08-09 00:00:00|2999-12-31 00:00:00|
+---+----+----+-------------------+-------------------+
此处要对记录进行分组的字段是“密钥”,我只想在密钥组中保留一个end_date为“ 2999-12-31 00:00:00”的记录。我要标记为所有其他记录已过期当我们按记录的生效日期排序时,终止日期是根据下一条记录的生效日期-1决定的。 我在下面尝试过
>>> from pyspark.sql import functions as F
>>> from pyspark.sql import Window
>>> w = Window.partitionBy("Key").orderBy("effective_date")
>>> df1=df.withColumn("end_date",F.date_sub(F.lead("effective_date").over(w), 1))
与此不匹配的输出。我正在使用Python 2.7和Spark 2.2
答案 0 :(得分:1)
使用 lead
尝试这样做:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("Key").orderBy("effective_date")
df.withColumn("lead", F.lead("effective_date").over(w))\
.withColumn("end_date", F.when(F.col("lead").isNotNull(), F.expr("""lead - interval 1 second"""))\
.otherwise(F.col("end_date"))).drop("lead")\
.orderBy("effective_date").show()
#+---+----+----+-------------------+-------------------+
#|Key|col1|col2| effective_date| end_date|
#+---+----+----+-------------------+-------------------+
#| X| ABC| DEF|2020-08-01 00:00:00|2020-08-02 23:59:59|
#| X|ABC1|DEF1|2020-08-03 00:00:00|2020-08-04 23:59:59|
#| X|ABC2|DEF2|2020-08-05 00:00:00|2999-12-31 00:00:00|
#| Y| PQR| STU|2020-08-07 00:00:00|2020-08-08 23:59:59|
#| Y|PQR1|STU1|2020-08-09 00:00:00|2999-12-31 00:00:00|
#+---+----+----+-------------------+-------------------+