我有一个数据集,其中包括员工姓名,余额,日期和员工都有单独的排名。
df.show();
+------------+----------+-------+----+
| Employee| date|balance|rank|
+------------+----------+-------+----+
| A |2016-02-05| 2143| 1|
| A |2016-07-05| 231| 2|
| A |2016-08-05| 447| 3|
| A |2017-10-05| 779| 4|
| A |2018-03-05| 255| 5|
| A |2018-05-05| 246| 6|
| A |2018-08-05| 378| 7|
| A |2018-11-05| 10635| 8|
| A |2019-06-05| 49| 9|
| A |2020-02-05| 0| 10|
| A |2020-04-05| 244| 11|
| A |2020-05-05| 0| 12|
| A |2020-09-05| 424| 13|
| C |2016-05-05| 1506| 1|
| C |2017-06-05| 52| 2|
| C |2017-09-05| 723| 3|
| C |2017-11-05| 23| 4|
+------------+----------+-------+----+
我必须按照排名分隔这个数据集。所以我的预期输出是
table1
+------------+----------+-------+----+
| Employee| date|balance|rank|
+------------+----------+-------+----+
| A |2016-02-05| 2143| 1|
| A |2016-07-05| 231| 2|
| A |2016-08-05| 447| 3|
| A |2017-10-05| 779| 4|
| A |2018-03-05| 255| 5|
| A |2018-05-05| 246| 6|
| A |2018-08-05| 378| 7|
| A |2018-11-05| 10635| 8|
| A |2019-06-05| 49| 9|
| A |2020-02-05| 0| 10|
| A |2020-04-05| 244| 11|
| A |2020-05-05| 0| 12|
| A |2020-09-05| 424| 13|
+------------+----------+-------+----+
table2
+------------+----------+-------+----+
| Employee| date|balance|rank|
+------------+----------+-------+----+
| C |2016-05-05| 1506| 1|
| C |2017-06-05| 52| 2|
| C |2017-09-05| 723| 3|
| C |2017-11-05| 23| 4|
+------------+----------+-------+----+
我使用了窗口函数来获得这个等级,但是我没有得到如何获得这样的单独表格。我使用的是spark 2.0.0和java。
WindowSpec ws = Window.partitionBy(Employee).orderBy(date);
data.withColumn( "rank", rank().over(ws) )
答案 0 :(得分:1)
以下是通过过滤Employee的不同值来实现此目的的示例代码:
//Getting the distinct columns
List<Row> distinctColumns = df.select("Employee").distinct().collectAsList();
//Initializing empty list for the new DataFrames
ArrayList<Dataset<Row>> newDFs = new ArrayList<>();
WindowSpec ws = Window.orderBy("date");
//Filtering by the distinct column values and adding to the list.
for (Row distinctColumn : distinctColumns) {
String colName = distinctColumn.getString(0);
newDFs.add(
df.filter(col("Employee").$eq$eq$eq(colName))
.withColumn("rank", rank().over(ws))
);
}
// show all the new DFs
for (Dataset<Row> aDF : newDFs) {
aDF.show();
}