Spark SQL排名问题

时间:2018-06-20 16:26:12

标签: apache-spark-sql

如果audit_timestamp是相同的,我试图采用相同的排名。

我正在使用spark 1.5 CDH5.5

val loc ="/data/published/logs/a.avro"
val df =sqlContext.read.avro(loc)
val concatDF = df.withColumn("concat_log", concat(df("service_key"),df("event_start_date_time")))
val md5DF =  concatDF.withColumn("nk_log",md5(concatDF("concat_log"))).drop("concat_log")
val windowFunction =  Window.partitionBy(md5DF("nk_log")).orderBy(md5DF("audit_timestamp").desc)
val rankDF = md5DF.withColumn("logs_number", rank().over(windowFunction))


rankDF.filter(rankDF("event_start_date_time") === "2018-06-05T15:00:00Z").select("nk_log","event_start_date_time","service_key","event_sequence","audit_timestamp","logs_number").show(100,false)


 Actual output : 

 +--------------------------------+---------------------+-----------+--------------+-------------------+-----------+
 |nk_log                          |event_start_date_time|service_key|event_sequence|audit_timestamp    |logs_number|
 +--------------------------------+---------------------+-----------+--------------+-------------------+-----------+
 |00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839       |371           |2018-06-10 10:05:38|1          |
 |00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839       |362           |2018-06-10 10:05:38|1          |
 |00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839       |386           |2018-06-08 10:05:37|1          |

 Expected output : 

 +--------------------------------+---------------------+-----------+--------------+-------------------+-----------+
 |nk_log                          |event_start_date_time|service_key|event_sequence|audit_timestamp    |logs_number|
 +--------------------------------+---------------------+-----------+--------------+-------------------+-----------+
 |00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839       |371           |2018-06-10 10:05:38|1          |
 |00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839       |362           |2018-06-10 10:05:38|1          |
 |00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839       |386           |2018-06-08 10:05:37|2          |

我不知道这里是什么问题。我如何获得预期的输出

0 个答案:

没有答案