Spark Scala中的分区功能

时间:2018-07-05 00:54:09

标签: scala apache-spark apache-spark-sql

DF:

ID col1 . .....coln....  Date
1                        1991-01-11 11:03:46.0
1                        1991-01-11 11:03:46.0
1                        1991-02-22 12:05:58.0
1                        1991-02-22 12:05:58.0
1                        1991-02-22 12:05:58.0

我正在创建一个新列“ identify”以查找(ID,DATE)的分区,并按照“ identify”的顺序选择最上面的组合

期望的DF:

ID col1 . .....coln....  Date .                    identify
1                        1991-01-11 11:03:46.0 .     1
1                        1991-01-11 11:03:46.0       1
1                        1991-02-22 12:05:58.0 .     2
1                        1991-02-22 12:05:58.0 .     2 
1                        1991-02-22 12:05:58.0 .     2

代码尝试1:

var window = Window.partitionBy("ID").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", row_number().over(window))

我的OP:

ID col1 . .....coln....  Date .                    identify
1                        1991-01-11 11:03:46.0 .     1
1                        1991-01-11 11:03:46.0       2
1                        1991-02-22 12:05:58.0 .     3
1                        1991-02-22 12:05:58.0 .     4
1                        1991-02-22 12:05:58.0 .     5

代码尝试2:

 var window = Window.partitionBy("ID","DATE").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", row_number().over(window))

我的OP:

ID col1 . .....coln....  Date .                    identify
1                        1991-01-11 11:03:46.0 .     1
1                        1991-01-11 11:03:46.0       2
1                        1991-02-22 12:05:58.0 .     1
1                        1991-02-22 12:05:58.0 .     2
1                        1991-02-22 12:05:58.0 .     3

关于如何调整代码以获取所需OP的任何建议都会有所帮助

1 个答案:

答案 0 :(得分:0)

var window = Window.partitionBy("ID").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", dense_rank().over(window))