如何对Pyspark数据帧中的密钥组执行SCAN操作

时间:2020-08-21 17:03:26

标签: apache-spark pyspark

专家,我试图在pyspark数据框中执行一种扫描操作,其中我根据关键组中的下一条记录在记录上标记结束日期。这就是我的数据框的样子-

+---+----+----+-------------------+-------------------+
|Key|col1|col2|     effective_date|           end_date|
+---+----+----+-------------------+-------------------+
|  X| ABC| DEF|2020-08-01 00:00:00|2999-12-31 00:00:00|
|  X|ABC1|DEF1|2020-08-03 00:00:00|2999-12-31 00:00:00|
|  X|ABC2|DEF2|2020-08-05 00:00:00|2999-12-31 00:00:00|
|  Y| PQR| STU|2020-08-07 00:00:00|2999-12-31 00:00:00|
|  Y|PQR1|STU1|2020-08-09 00:00:00|2999-12-31 00:00:00|
+---+----+----+-------------------+-------------------+

期望的出站-

+---+----+----+-------------------+-------------------+
|Key|col1|col2|     effective_date|           end_date|
+---+----+----+-------------------+-------------------+
|  X| ABC| DEF|2020-08-01 00:00:00|2020-08-02 23:59:59|
|  X|ABC1|DEF1|2020-08-03 00:00:00|2020-08-04 23:59:59|
|  X|ABC2|DEF2|2020-08-05 00:00:00|2999-12-31 00:00:00|
|  Y| PQR| STU|2020-08-07 00:00:00|2020-08-08 23:59:59|
|  Y|PQR1|STU1|2020-08-09 00:00:00|2999-12-31 00:00:00|
+---+----+----+-------------------+-------------------+

此处要对记录进行分组的字段是“密钥”,我只想在密钥组中保留一个end_date为“ 2999-12-31 00:00:00”的记录。我要标记为所有其他记录已过期当我们按记录的生效日期排序时,终止日期是根据下一条记录的生效日期-1决定的。 我在下面尝试过

>>> from pyspark.sql import functions as F
>>> from pyspark.sql import Window
>>> w = Window.partitionBy("Key").orderBy("effective_date")
>>> df1=df.withColumn("end_date",F.date_sub(F.lead("effective_date").over(w), 1))

与此不匹配的输出。我正在使用Python 2.7和Spark 2.2

1 个答案:

答案 0 :(得分:1)

使用 lead 尝试这样做:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w=Window().partitionBy("Key").orderBy("effective_date")

df.withColumn("lead", F.lead("effective_date").over(w))\
  .withColumn("end_date", F.when(F.col("lead").isNotNull(), F.expr("""lead - interval 1 second"""))\
                           .otherwise(F.col("end_date"))).drop("lead")\
  .orderBy("effective_date").show()

#+---+----+----+-------------------+-------------------+
#|Key|col1|col2|     effective_date|           end_date|
#+---+----+----+-------------------+-------------------+
#|  X| ABC| DEF|2020-08-01 00:00:00|2020-08-02 23:59:59|
#|  X|ABC1|DEF1|2020-08-03 00:00:00|2020-08-04 23:59:59|
#|  X|ABC2|DEF2|2020-08-05 00:00:00|2999-12-31 00:00:00|
#|  Y| PQR| STU|2020-08-07 00:00:00|2020-08-08 23:59:59|
#|  Y|PQR1|STU1|2020-08-09 00:00:00|2999-12-31 00:00:00|
#+---+----+----+-------------------+-------------------+