Dataframe中的两个最新记录

时间:2017-10-16 08:58:10

标签: sql apache-spark dataframe

我有一个Apache Spark Dataframe,其中包含以下数据(ID,Name,DATE):

ID,Name,DATE
1,Anil,2000-06-02
1,Anil,2000-06-03
1,Anil,2000-06-04
2,Arun,2000-06-05
2,Arun,2000-06-06
2,Arun,2000-06-07
3,Anju,2000-06-08
3,Anju,2000-06-09
3,Anju,2000-06-10
4,Ram,2000-06-11
4,Ram,2000-06-02
4,Ram,2000-06-03
4,Ram,2000-06-04
5,Ramu,2000-06-05
5,Ramu,2000-06-06
5,Ramu,2000-06-07
5,Ramu,2000-06-08
6,Renu,2000-06-09
7,Gopu,2000-06-10
7,Gopu,2000-06-11

但我想要ID的前两个最新记录,我想获得以下输出:

ID,Name,DATE
1,Anil,2000-06-03
1,Anil,2000-06-04
2,Arun,2000-06-06
2,Arun,2000-06-07
3,Anju,2000-06-09
3,Anju,2000-06-10
4,Ram,2000-06-03
4,Ram,2000-06-04
5,Ramu,2000-06-07
5,Ramu,2000-06-08
6,Renu,2000-06-09
7,Gopu,2000-06-10
7,Gopu,2000-06-11

我是否需要使用像Lag这样的窗口函数?

2 个答案:

答案 0 :(得分:5)

LEFT OUTER JOIN使用COUNT&lt; SELECT d.ID, d.Name, d.Date FROM Dataframetable d LEFT OUTER JOIN Dataframetable d2 ON d2.ID = d.ID AND d.Date < d2.Date GROUP BY d.ID, d.Name, d.Date HAVING COUNT(*) < 2 2。

ID  Name    Date
1   Anil    2000-06-03T00:00:00Z
1   Anil    2000-06-04T00:00:00Z
2   Arun    2000-06-06T00:00:00Z
2   Arun    2000-06-07T00:00:00Z
3   Anju    2000-06-09T00:00:00Z
3   Anju    2000-06-10T00:00:00Z
4   Ram     2000-06-04T00:00:00Z
4   Ram     2000-06-11T00:00:00Z
5   Ramu    2000-06-07T00:00:00Z
5   Ramu    2000-06-08T00:00:00Z
6   Renu    2000-06-09T00:00:00Z
7   Gopu    2000-06-10T00:00:00Z
7   Gopu    2000-06-11T00:00:00Z

输出

SELECT ID, name, date FROM (SELECT d.ID, d.Name, MAX(d.Date) Date
FROM Dataframetable d
GROUP BY d.ID, d.Name
UNION ALL
SELECT d.ID, d.Name, MAX(d.Date)
FROM Dataframetable d
WHERE d.Date NOT IN 
(SELECT date FROM (SELECT d.ID, d.Name, MAX(d.Date) Date
FROM Dataframetable d
GROUP BY d.ID, d.Name) a)
GROUP BY d.ID, d.Name) b
ORDER BY ID

SQL小提琴:http://sqlfiddle.com/#!6/8dcc2/1/0

使用子查询而不是自联接。

{{1}}

SQL小提琴:http://sqlfiddle.com/#!6/8dcc2/19/0

答案 1 :(得分:3)

谢谢@Matt - 您的解决方案可以正常使用Apache Spark进行测试。

  val sparkConf = new SparkConf().setAppName("DFTest").setMaster("local[5]")
  val sc = new SparkContext(sparkConf)
  val hadoopConf = sc.hadoopConfiguration
  val sqlContext = new SQLContext(sc)

  val myFile = sc.textFile("C:\\DFTest\\DFTest.txt")

  case class Record(id: Int, name: String, datetime : String) 
  val myFile1 = myFile.map(x=>x.split(",")).map {
    case Array(id, name, datetime) => Record(id.toInt, name,datetime)
  }

  import sqlContext.implicits._

  val myDF = myFile1.toDF()

  myDF.registerTempTable("deep_cust")

  sqlContext.sql("SELECT d.id, d.name, d.datetime FROM deep_cust d " +
    "LEFT OUTER JOIN deep_cust d2 ON d2.id = d.id AND d.datetime < d2.datetime " +
    "GROUP BY d.id, d.name, d.datetime " +
    "HAVING COUNT(*) < 2").show()

但它不会直接与Hive一起工作,因为Hive不支持非equi连接,我们必须使用其他替代方案,如RANK。

替代方法:

@Matt如果下面的RANK解决方案比连接更快,请你建议我。如果没有,那么我们必须使用where子句而不是 AND d.Date&lt; d2.Date  

select x.id,x.name,x.datetime from (select id,name,datetime,rank() over (partition by id,name order by datetime desc) as rownum from deep_cust) x where x.rownum<3;