Scala程序搜索最新值

时间:2018-08-28 23:57:00

标签: scala apache-spark bigdata

我想基于以下配置单元sql创建df:

WITH FILTERED_table1 AS (select *
, row_number() over (partition by key_timestamp order by datime DESC) rn
FROM table1)

scala function:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val table1 = Window.partitionBy($"key_timestamp").orderBy($"datime".desc)

我研究了window函数,这就是我能想到的,由于我对scala还是很陌生,所以我不确定如何在scala函数中编写此函数。如何从sql使用scala函数返回df? 任何建议将不胜感激。 :)

1 个答案:

答案 0 :(得分:1)

您的窗口规格正确。首先使用虚拟数据集,将原始的Hive表加载到DataFrame中:

val df = spark.sql("""select * from table1""")

df.show
// +-------------+-------------------+
// |key_timestamp|             datime|
// +-------------+-------------------+
// |            1|2018-06-01 00:00:00|
// |            1|2018-07-01 00:00:00|
// |            2|2018-05-01 00:00:00|
// |            2|2018-07-01 00:00:00|
// |            2|2018-06-01 00:00:00|
// +-------------+-------------------+

要将Window函数row_number(根据Window规范)应用于DataFrame,请使用withColumn生成一个新列以捕获函数结果:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val window = Window.partitionBy($"key_timestamp").orderBy($"datime".desc)

val resultDF = df.withColumn("rn", row_number.over(window))

resultDF.show
// +-------------+-------------------+---+
// |key_timestamp|             datime| rn|
// +-------------+-------------------+---+
// |            1|2018-07-01 00:00:00|  1|
// |            1|2018-06-01 00:00:00|  2|
// |            2|2018-07-01 00:00:00|  1|
// |            2|2018-06-01 00:00:00|  2|
// |            2|2018-05-01 00:00:00|  3|
// +-------------+-------------------+---+

要进行验证,请针对table1运行SQL,您应该得到相同的结果:

spark.sql("""
    select *, row_number() over
      (partition by key_timestamp order by datime desc) rn
    from table1
  """).show