使用spark窗口函数获取最后一个值

时间:2018-05-22 12:43:49

标签: apache-spark apache-spark-sql

假设我有这样的数据帧。

val df = sc.parallelize(Seq(
            (1.0, 1,"Matt"), 
            (1.0, 2,"John"),
            (1.0, 3,null.asInstanceOf[String]),
            (-1.0, 2,"Adam"), 
            (-1.0, 4,"Steve"))
          ).toDF("id", "timestamp","name")

我想获取按时间戳排序的每个id的最后一个非null值。这是我的窗口

val partitionWindow = Window.partitionBy($"id").orderBy($"timestamp".desc)

我正在创建一个独特的窗口数据

val filteredDF = df.filter($"name".isNotNull).withColumn("firstName", first("name") over (partitionWindow)).drop("timestamp","name").distinct

并将其加回实际数据

val joinedDF = df.join(filteredDF, windowDF.col("id") === filteredDF.col("id")).drop(filteredDF.col("id"))

joinedDF.show()

它工作正常,但我不喜欢这个解决方案,有人能建议我更好的东西吗?

此外,谁能告诉我为什么最后一个功能不起作用?我试过这个并且结果不正确

 val partitionWindow = Window.partitionBy($"id").orderBy($"timestamp")

val windowDF = df.withColumn("lastName", last("name") over (partitionWindow))

1 个答案:

答案 0 :(得分:3)

如果要传播最后一个已知值(它与join使用的逻辑不同),您应该:

  • ORDER BY timestamp
  • 取消last忽略nulls
val partitionWindow = Window.partitionBy($"id").orderBy($"timestamp")

df.withColumn("lastName", last("name", true) over (partitionWindow)).show
// +----+---------+-----+--------+
// |  id|timestamp| name|lastName|
// +----+---------+-----+--------+
// |-1.0|        2| Adam|    Adam|
// |-1.0|        4|Steve|   Steve|
// | 1.0|        1| Matt|    Matt|
// | 1.0|        2| John|    John|
// | 1.0|        3| null|    John|
// +----+---------+-----+--------+

如果您想全局采用最后一个值:

  • ORDER BY timestamp
  • 设置无界框。
  • 取消last忽略nulls
val partitionWindow = Window.partitionBy($"id").orderBy($"timestamp")
  .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

df.withColumn("lastName", last("name", true) over (partitionWindow)).show
// +----+---------+-----+--------+
// |  id|timestamp| name|lastName|
// +----+---------+-----+--------+
// |-1.0|        2| Adam|   Steve|
// |-1.0|        4|Steve|   Steve|
// | 1.0|        1| Matt|    John|
// | 1.0|        2| John|    John|
// | 1.0|        3| null|    John|
// +----+---------+-----+--------+