Dataframe过滤器问题,该怎么办?

时间:2017-01-04 04:52:18

标签: scala apache-spark filter spark-dataframe

环境:Spark 1.6,Scala

我的数据框就像下面的

DF =
DT col1 col2
---------- | --- | ----
2017011011 | AA | BB
2017011011 | CC | DD
2017011015 | PP | BB
2017011015 | QQ | DD
2017011016 | AA | BB
2017011016 | CC | DD
2017011017 | PP | BB
2017011017 | QQ | DD

如何过滤以获得类似SQL的结果 - select * from DF where dt> (select distinct dt from DF order by dt desc limit 3)

输出有最后3个日期

2017011015 | PP | BB
2017011015 | QQ | DD
2017011016 | AA | BB
2017011016 | CC | DD
2017011017 | PP | BB
2017011017 | QQ | DD

感谢
侯塞因

1 个答案:

答案 0 :(得分:0)

在Spark 1.6.1上测试

import sqlContext.implicit._
val df = sqlContext.createDataFrame(Seq(
  (2017011011, "AA", "BB"),
  (2017011011, "CC", "DD"),
  (2017011015, "PP", "BB"),
  (2017011015, "QQ", "DD"),
  (2017011016, "AA", "BB"),
  (2017011016, "CC", "DD"),
  (2017011017, "PP", "BB"),
  (2017011017, "QQ", "DD")
)).select(
  $"_1".as("DT"),
  $"_2".as("col1"),
  $"_3".as("col2")
) 

val dates = df.select($"DT")
  .distinct()
  .orderBy(-$"DT")
  .map(_.getInt(0))
  .take(3)

val result = df.filter(dates.map($"DT" === _).reduce(_ || _))
result.show()