如何获得订单后的最后一行

时间:2017-09-27 08:06:28

标签: scala apache-spark

我有如下的DF:

val df = ss.sparkContext.parallelize( Seq (
  ("c1", "2017-1-1 00:00:00", 10, "A", "A,B"),
  ("c1", "2017-11-1 00:00:00", 10, "A", "A,B"),
  ("c1", "2017-5-1 00:00:00", 12, "B", "A,B"),
  ("c1", "2017-7-1 00:00:00", 13, "B", "A,B"),
  ("c2", "2017-3-1 00:00:00", 10, "B", "A,B"),
  ("c2", "2017-8-1 00:00:00", 11, "C", "A,B"),
  ("c2", "2017-5-1 00:00:00", 20, "C", "A,B"),
  ("c2", "2017-1-1 00:00:00", 18, "A", "A,B"),
  ("c2", "2017-9-1 00:00:00", 17, "A", "A,B")
)).toDF("city", "month", "sales", "area", "arealist")

val strToDate = udf( (str : String) => {
  val sdf = new SimpleDateFormat("yyyy-MM-dd ");
  Timestamp.valueOf(str)
})
df.withColumn("month", strToDate($"month")).orderBy("city","month").show

输出结果为:

+----+-------------------+-----+----+--------+
|city|              month|sales|area|arealist|
+----+-------------------+-----+----+--------+
|  c1|2017-01-01 00:00:00|   10|   A|     A,B|
|  c1|2017-05-01 00:00:00|   12|   B|     A,B|
|  c1|2017-07-01 00:00:00|   13|   B|     A,B|
|  c1|2017-11-01 00:00:00|   10|   A|     A,B|
|  c2|2017-01-01 00:00:00|   18|   A|     A,B|
|  c2|2017-03-01 00:00:00|   10|   B|     A,B|
|  c2|2017-05-01 00:00:00|   20|   C|     A,B|
|  c2|2017-08-01 00:00:00|   11|   C|     A,B|
|  c2|2017-09-01 00:00:00|   17|   A|     A,B|
+----+-------------------+-----+----+--------+

我想得到“城市”的最后两行。它看起来如下:

+----+-------------------+-----+----+--------+
|  c1|2017-07-01 00:00:00|   13|   B|     A,B|
|  c1|2017-11-01 00:00:00|   10|   A|     A,B|
|  c2|2017-08-01 00:00:00|   11|   C|     A,B|
|  c2|2017-09-01 00:00:00|   17|   A|     A,B|
+----+-------------------+-----+----+--------+

我该怎么办?

1 个答案:

答案 0 :(得分:1)

您可以使用Window功能对每个城市的观察进行排名,然后只保留最近一个月的观察结果。

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{rank,col}

val window = Window.partitionBy(df("city")).orderBy(df("month").desc)

df.withColumn("rank", rank().over(window))
  .filter(col("rank") <= 2) 
  .drop("rank")
  .show()
+----+----------+-----+----+--------+
|city|     month|sales|area|arealist|
+----+----------+-----+----+--------+
|  c1|2017-11-01|   10|   A|     A,B|
|  c1|2017-07-01|   13|   B|     A,B|
|  c2|2017-09-01|   17|   A|     A,B|
|  c2|2017-08-01|   11|   C|     A,B|
+----+----------+-----+----+--------+