我正在使用lead
窗口函数,其中我正在尝试使用md5
函数将该函数用于多列值。
基本上,strt_dts
的值应放在end_dts
代码:
md5DF = df.withColumn("md5Val", md5(concat_ws(",","instnc_nm", "strm_nb")))
md5DF.show(20, False)
md5DF.withColumn('end_dts',
lead(md5DF['md5Val'], default='9999-12-31 00:00:00.000000')
.over(Window.partitionBy("Id").orderBy("strt_dts")))\
.show(20, False)
输出:
+--------+-------------------+---------+------------+--------------------------------+
|Id |strt_dts |instnc_nm|strm_nb |md5Val |
+--------+-------------------+---------+------------+--------------------------------+
|27608171|2018-07-17 04:00:00|Standard |11B |bc3cdb48b849c565c241e2c0f8b7d156|
|27608174|2018-08-17 04:00:00|Standard |11C |da13b49654960a0de488dae1b4e7e7d3|
|27608173|2018-09-17 04:00:00|Standard |11D |de580c5dc7ffa27324a2a8845d1347dc|
|27608171|2018-09-17 04:00:00|Standard |11D |de580c5dc7ffa27324a2a8845d1347dc|
|27608174|2018-09-17 04:00:00|Standard |11D |de580c5dc7ffa27324a2a8845d1347dc|
+--------+-------------------+---------+------------+--------------------------------+
+--------+-------------------+---------+------------+--------------------------------+--------------------------------+
|Id |strt_dts |instnc_nm |strm_nb |md5Val |end_dts |
+--------+-------------------+---------+------------+--------------------------------+--------------------------------+
|27608171|2018-07-17 04:00:00|Standard |11B |bc3cdb48b849c565c241e2c0f8b7d156|de580c5dc7ffa27324a2a8845d1347dc|
|27608171|2018-09-17 04:00:00|Standard |11D |de580c5dc7ffa27324a2a8845d1347dc|9999-12-31 00:00:00.000000 |
|27608174|2018-08-17 04:00:00|Standard |11C |da13b49654960a0de488dae1b4e7e7d3|de580c5dc7ffa27324a2a8845d1347dc|
|27608174|2018-09-17 04:00:00|Standard |11D |de580c5dc7ffa27324a2a8845d1347dc|9999-12-31 00:00:00.000000 |
|27608173|2018-09-17 04:00:00|Standard |11D |de580c5dc7ffa27324a2a8845d1347dc|9999-12-31 00:00:00.000000 |
+--------+-------------------+---------+------------+--------------------------------+--------------------------------+
预期:
无论md5 hash code
中有end_dts
,它都应作为strt_dts
值来出现。
+--------+-------------------+---------+------------+--------------------------------+--------------------------------+
|Id |strt_dts |instnc_nm |strm_nb |md5Val |end_dts |
+--------+-------------------+---------+------------+--------------------------------+--------------------------------+
|27608171|2018-07-17 04:00:00|Standard |11B |bc3cdb48b849c565c241e2c0f8b7d156|2018-09-17 04:00:00|
|27608171|2018-09-17 04:00:00|Standard |11D |de580c5dc7ffa27324a2a8845d1347dc|9999-12-31 00:00:00.000000 |
|27608174|2018-08-17 04:00:00|Standard |11C |da13b49654960a0de488dae1b4e7e7d3|2018-09-17 04:00:00|
|27608174|2018-09-17 04:00:00|Standard |11D |de580c5dc7ffa27324a2a8845d1347dc|9999-12-31 00:00:00.000000 |
|27608173|2018-09-17 04:00:00|Standard |11D |de580c5dc7ffa27324a2a8845d1347dc|9999-12-31 00:00:00.000000 |
+--------+-------------------+---------+------------+--------------------------------+--------------------------------+
更新:
我用下面的代码得到了答案,但是我觉得这是最糟糕的方式,任何有效的方式。
w = Window.partitionBy("Id").orderBy("strt_dts")
cond = when(lead(md5DF['md5Val']).over(w) != md5DF['md5Val'], lead(md5DF['strt_dts']).over(w))\
.otherwise(lit('9999-12-31 00:00:00.000000'))
md5DF.withColumn('end_dts',cond).show()