我有如下的DF:
val df = ss.sparkContext.parallelize( Seq (
("c1", "2017-1-1 00:00:00", 10, "A", "A,B"),
("c1", "2017-11-1 00:00:00", 10, "A", "A,B"),
("c1", "2017-5-1 00:00:00", 12, "B", "A,B"),
("c1", "2017-7-1 00:00:00", 13, "B", "A,B"),
("c2", "2017-3-1 00:00:00", 10, "B", "A,B"),
("c2", "2017-8-1 00:00:00", 11, "C", "A,B"),
("c2", "2017-5-1 00:00:00", 20, "C", "A,B"),
("c2", "2017-1-1 00:00:00", 18, "A", "A,B"),
("c2", "2017-9-1 00:00:00", 17, "A", "A,B")
)).toDF("city", "month", "sales", "area", "arealist")
val strToDate = udf( (str : String) => {
val sdf = new SimpleDateFormat("yyyy-MM-dd ");
Timestamp.valueOf(str)
})
df.withColumn("month", strToDate($"month")).orderBy("city","month").show
输出结果为:
+----+-------------------+-----+----+--------+
|city| month|sales|area|arealist|
+----+-------------------+-----+----+--------+
| c1|2017-01-01 00:00:00| 10| A| A,B|
| c1|2017-05-01 00:00:00| 12| B| A,B|
| c1|2017-07-01 00:00:00| 13| B| A,B|
| c1|2017-11-01 00:00:00| 10| A| A,B|
| c2|2017-01-01 00:00:00| 18| A| A,B|
| c2|2017-03-01 00:00:00| 10| B| A,B|
| c2|2017-05-01 00:00:00| 20| C| A,B|
| c2|2017-08-01 00:00:00| 11| C| A,B|
| c2|2017-09-01 00:00:00| 17| A| A,B|
+----+-------------------+-----+----+--------+
我想得到“城市”的最后两行。它看起来如下:
+----+-------------------+-----+----+--------+
| c1|2017-07-01 00:00:00| 13| B| A,B|
| c1|2017-11-01 00:00:00| 10| A| A,B|
| c2|2017-08-01 00:00:00| 11| C| A,B|
| c2|2017-09-01 00:00:00| 17| A| A,B|
+----+-------------------+-----+----+--------+
我该怎么办?
答案 0 :(得分:1)
您可以使用Window
功能对每个城市的观察进行排名,然后只保留最近一个月的观察结果。
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{rank,col}
val window = Window.partitionBy(df("city")).orderBy(df("month").desc)
df.withColumn("rank", rank().over(window))
.filter(col("rank") <= 2)
.drop("rank")
.show()
+----+----------+-----+----+--------+
|city| month|sales|area|arealist|
+----+----------+-----+----+--------+
| c1|2017-11-01| 10| A| A,B|
| c1|2017-07-01| 13| B| A,B|
| c2|2017-09-01| 17| A| A,B|
| c2|2017-08-01| 11| C| A,B|
+----+----------+-----+----+--------+