如果audit_timestamp是相同的,我试图采用相同的排名。
我正在使用spark 1.5 CDH5.5
val loc ="/data/published/logs/a.avro"
val df =sqlContext.read.avro(loc)
val concatDF = df.withColumn("concat_log", concat(df("service_key"),df("event_start_date_time")))
val md5DF = concatDF.withColumn("nk_log",md5(concatDF("concat_log"))).drop("concat_log")
val windowFunction = Window.partitionBy(md5DF("nk_log")).orderBy(md5DF("audit_timestamp").desc)
val rankDF = md5DF.withColumn("logs_number", rank().over(windowFunction))
rankDF.filter(rankDF("event_start_date_time") === "2018-06-05T15:00:00Z").select("nk_log","event_start_date_time","service_key","event_sequence","audit_timestamp","logs_number").show(100,false)
Actual output :
+--------------------------------+---------------------+-----------+--------------+-------------------+-----------+
|nk_log |event_start_date_time|service_key|event_sequence|audit_timestamp |logs_number|
+--------------------------------+---------------------+-----------+--------------+-------------------+-----------+
|00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839 |371 |2018-06-10 10:05:38|1 |
|00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839 |362 |2018-06-10 10:05:38|1 |
|00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839 |386 |2018-06-08 10:05:37|1 |
Expected output :
+--------------------------------+---------------------+-----------+--------------+-------------------+-----------+
|nk_log |event_start_date_time|service_key|event_sequence|audit_timestamp |logs_number|
+--------------------------------+---------------------+-----------+--------------+-------------------+-----------+
|00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839 |371 |2018-06-10 10:05:38|1 |
|00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839 |362 |2018-06-10 10:05:38|1 |
|00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839 |386 |2018-06-08 10:05:37|2 |
我不知道这里是什么问题。我如何获得预期的输出