我有一个Apache Spark Dataframe,其中包含以下数据(ID,Name,DATE):
ID,Name,DATE
1,Anil,2000-06-02
1,Anil,2000-06-03
1,Anil,2000-06-04
2,Arun,2000-06-05
2,Arun,2000-06-06
2,Arun,2000-06-07
3,Anju,2000-06-08
3,Anju,2000-06-09
3,Anju,2000-06-10
4,Ram,2000-06-11
4,Ram,2000-06-02
4,Ram,2000-06-03
4,Ram,2000-06-04
5,Ramu,2000-06-05
5,Ramu,2000-06-06
5,Ramu,2000-06-07
5,Ramu,2000-06-08
6,Renu,2000-06-09
7,Gopu,2000-06-10
7,Gopu,2000-06-11
但我想要ID的前两个最新记录,我想获得以下输出:
ID,Name,DATE
1,Anil,2000-06-03
1,Anil,2000-06-04
2,Arun,2000-06-06
2,Arun,2000-06-07
3,Anju,2000-06-09
3,Anju,2000-06-10
4,Ram,2000-06-03
4,Ram,2000-06-04
5,Ramu,2000-06-07
5,Ramu,2000-06-08
6,Renu,2000-06-09
7,Gopu,2000-06-10
7,Gopu,2000-06-11
我是否需要使用像Lag这样的窗口函数?
答案 0 :(得分:5)
LEFT OUTER JOIN
使用COUNT
< SELECT d.ID, d.Name, d.Date
FROM Dataframetable d
LEFT OUTER JOIN Dataframetable d2 ON d2.ID = d.ID AND d.Date < d2.Date
GROUP BY d.ID, d.Name, d.Date
HAVING COUNT(*) < 2
2。
ID Name Date
1 Anil 2000-06-03T00:00:00Z
1 Anil 2000-06-04T00:00:00Z
2 Arun 2000-06-06T00:00:00Z
2 Arun 2000-06-07T00:00:00Z
3 Anju 2000-06-09T00:00:00Z
3 Anju 2000-06-10T00:00:00Z
4 Ram 2000-06-04T00:00:00Z
4 Ram 2000-06-11T00:00:00Z
5 Ramu 2000-06-07T00:00:00Z
5 Ramu 2000-06-08T00:00:00Z
6 Renu 2000-06-09T00:00:00Z
7 Gopu 2000-06-10T00:00:00Z
7 Gopu 2000-06-11T00:00:00Z
输出
SELECT ID, name, date FROM (SELECT d.ID, d.Name, MAX(d.Date) Date
FROM Dataframetable d
GROUP BY d.ID, d.Name
UNION ALL
SELECT d.ID, d.Name, MAX(d.Date)
FROM Dataframetable d
WHERE d.Date NOT IN
(SELECT date FROM (SELECT d.ID, d.Name, MAX(d.Date) Date
FROM Dataframetable d
GROUP BY d.ID, d.Name) a)
GROUP BY d.ID, d.Name) b
ORDER BY ID
SQL小提琴:http://sqlfiddle.com/#!6/8dcc2/1/0
使用子查询而不是自联接。
{{1}}
答案 1 :(得分:3)
谢谢@Matt - 您的解决方案可以正常使用Apache Spark进行测试。
val sparkConf = new SparkConf().setAppName("DFTest").setMaster("local[5]")
val sc = new SparkContext(sparkConf)
val hadoopConf = sc.hadoopConfiguration
val sqlContext = new SQLContext(sc)
val myFile = sc.textFile("C:\\DFTest\\DFTest.txt")
case class Record(id: Int, name: String, datetime : String)
val myFile1 = myFile.map(x=>x.split(",")).map {
case Array(id, name, datetime) => Record(id.toInt, name,datetime)
}
import sqlContext.implicits._
val myDF = myFile1.toDF()
myDF.registerTempTable("deep_cust")
sqlContext.sql("SELECT d.id, d.name, d.datetime FROM deep_cust d " +
"LEFT OUTER JOIN deep_cust d2 ON d2.id = d.id AND d.datetime < d2.datetime " +
"GROUP BY d.id, d.name, d.datetime " +
"HAVING COUNT(*) < 2").show()
但它不会直接与Hive一起工作,因为Hive不支持非equi连接,我们必须使用其他替代方案,如RANK。
替代方法:
@Matt如果下面的RANK解决方案比连接更快,请你建议我。如果没有,那么我们必须使用where子句而不是 AND d.Date&lt; d2.Date 强>
select x.id,x.name,x.datetime from (select id,name,datetime,rank() over (partition by id,name order by datetime desc) as rownum from deep_cust) x where x.rownum<3;