DF:
ID col1 . .....coln.... Date
1 1991-01-11 11:03:46.0
1 1991-01-11 11:03:46.0
1 1991-02-22 12:05:58.0
1 1991-02-22 12:05:58.0
1 1991-02-22 12:05:58.0
我正在创建一个新列“ identify”以查找(ID,DATE)的分区,并按照“ identify”的顺序选择最上面的组合
期望的DF:
ID col1 . .....coln.... Date . identify
1 1991-01-11 11:03:46.0 . 1
1 1991-01-11 11:03:46.0 1
1 1991-02-22 12:05:58.0 . 2
1 1991-02-22 12:05:58.0 . 2
1 1991-02-22 12:05:58.0 . 2
代码尝试1:
var window = Window.partitionBy("ID").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", row_number().over(window))
我的OP:
ID col1 . .....coln.... Date . identify
1 1991-01-11 11:03:46.0 . 1
1 1991-01-11 11:03:46.0 2
1 1991-02-22 12:05:58.0 . 3
1 1991-02-22 12:05:58.0 . 4
1 1991-02-22 12:05:58.0 . 5
代码尝试2:
var window = Window.partitionBy("ID","DATE").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", row_number().over(window))
我的OP:
ID col1 . .....coln.... Date . identify
1 1991-01-11 11:03:46.0 . 1
1 1991-01-11 11:03:46.0 2
1 1991-02-22 12:05:58.0 . 1
1 1991-02-22 12:05:58.0 . 2
1 1991-02-22 12:05:58.0 . 3
关于如何调整代码以获取所需OP的任何建议都会有所帮助
答案 0 :(得分:0)
var window = Window.partitionBy("ID").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", dense_rank().over(window))