环境:Spark 1.6,Scala
我的数据框就像下面的
DF =
DT col1 col2
---------- | --- | ----
2017011011 | AA | BB
2017011011 | CC | DD
2017011015 | PP | BB
2017011015 | QQ | DD
2017011016 | AA | BB
2017011016 | CC | DD
2017011017 | PP | BB
2017011017 | QQ | DD
如何过滤以获得类似SQL的结果 - select * from DF where dt> (select distinct dt from DF order by dt desc limit 3)
输出有最后3个日期
2017011015 | PP | BB
2017011015 | QQ | DD
2017011016 | AA | BB
2017011016 | CC | DD
2017011017 | PP | BB
2017011017 | QQ | DD
感谢
侯塞因
答案 0 :(得分:0)
在Spark 1.6.1上测试
import sqlContext.implicit._
val df = sqlContext.createDataFrame(Seq(
(2017011011, "AA", "BB"),
(2017011011, "CC", "DD"),
(2017011015, "PP", "BB"),
(2017011015, "QQ", "DD"),
(2017011016, "AA", "BB"),
(2017011016, "CC", "DD"),
(2017011017, "PP", "BB"),
(2017011017, "QQ", "DD")
)).select(
$"_1".as("DT"),
$"_2".as("col1"),
$"_3".as("col2")
)
val dates = df.select($"DT")
.distinct()
.orderBy(-$"DT")
.map(_.getInt(0))
.take(3)
val result = df.filter(dates.map($"DT" === _).reduce(_ || _))
result.show()