带MD5计算值的车窗功能线

时间:2018-07-26 17:08:53

标签: pyspark apache-spark-sql

我正在使用lead窗口函数,其中我正在尝试使用md5函数将该函数用于多列值。

基本上,strt_dts的值应放在end_dts

代码:

md5DF = df.withColumn("md5Val", md5(concat_ws(",","instnc_nm", "strm_nb")))
md5DF.show(20, False)
md5DF.withColumn('end_dts',
                  lead(md5DF['md5Val'], default='9999-12-31 00:00:00.000000')
                  .over(Window.partitionBy("Id").orderBy("strt_dts")))\
      .show(20, False)

输出:

+--------+-------------------+---------+------------+--------------------------------+
|Id      |strt_dts           |instnc_nm|strm_nb     |md5Val                    |
+--------+-------------------+---------+------------+--------------------------------+
|27608171|2018-07-17 04:00:00|Standard |11B         |bc3cdb48b849c565c241e2c0f8b7d156|
|27608174|2018-08-17 04:00:00|Standard |11C         |da13b49654960a0de488dae1b4e7e7d3|
|27608173|2018-09-17 04:00:00|Standard |11D         |de580c5dc7ffa27324a2a8845d1347dc|
|27608171|2018-09-17 04:00:00|Standard |11D         |de580c5dc7ffa27324a2a8845d1347dc|
|27608174|2018-09-17 04:00:00|Standard |11D         |de580c5dc7ffa27324a2a8845d1347dc|
+--------+-------------------+---------+------------+--------------------------------+

+--------+-------------------+---------+------------+--------------------------------+--------------------------------+
|Id      |strt_dts          |instnc_nm |strm_nb     |md5Val                          |end_dts                    |
+--------+-------------------+---------+------------+--------------------------------+--------------------------------+
|27608171|2018-07-17 04:00:00|Standard |11B         |bc3cdb48b849c565c241e2c0f8b7d156|de580c5dc7ffa27324a2a8845d1347dc|
|27608171|2018-09-17 04:00:00|Standard |11D         |de580c5dc7ffa27324a2a8845d1347dc|9999-12-31 00:00:00.000000      |
|27608174|2018-08-17 04:00:00|Standard |11C         |da13b49654960a0de488dae1b4e7e7d3|de580c5dc7ffa27324a2a8845d1347dc|
|27608174|2018-09-17 04:00:00|Standard |11D         |de580c5dc7ffa27324a2a8845d1347dc|9999-12-31 00:00:00.000000      |
|27608173|2018-09-17 04:00:00|Standard |11D         |de580c5dc7ffa27324a2a8845d1347dc|9999-12-31 00:00:00.000000      |
+--------+-------------------+---------+------------+--------------------------------+--------------------------------+

预期:

无论md5 hash code中有end_dts,它都应作为strt_dts值来出现。

+--------+-------------------+---------+------------+--------------------------------+--------------------------------+
|Id      |strt_dts          |instnc_nm |strm_nb     |md5Val                          |end_dts                    |
+--------+-------------------+---------+------------+--------------------------------+--------------------------------+
|27608171|2018-07-17 04:00:00|Standard |11B         |bc3cdb48b849c565c241e2c0f8b7d156|2018-09-17 04:00:00|
|27608171|2018-09-17 04:00:00|Standard |11D         |de580c5dc7ffa27324a2a8845d1347dc|9999-12-31 00:00:00.000000      |
|27608174|2018-08-17 04:00:00|Standard |11C         |da13b49654960a0de488dae1b4e7e7d3|2018-09-17 04:00:00|
|27608174|2018-09-17 04:00:00|Standard |11D         |de580c5dc7ffa27324a2a8845d1347dc|9999-12-31 00:00:00.000000      |
|27608173|2018-09-17 04:00:00|Standard |11D         |de580c5dc7ffa27324a2a8845d1347dc|9999-12-31 00:00:00.000000      |
+--------+-------------------+---------+------------+--------------------------------+--------------------------------+

更新

我用下面的代码得到了答案,但是我觉得这是最糟糕的方式,任何有效的方式。

w = Window.partitionBy("Id").orderBy("strt_dts")
    cond = when(lead(md5DF['md5Val']).over(w) != md5DF['md5Val'], lead(md5DF['strt_dts']).over(w))\
        .otherwise(lit('9999-12-31 00:00:00.000000'))
    md5DF.withColumn('end_dts',cond).show()

0 个答案:

没有答案