我想实现无任何顺序的分区,以便数据可以在数据框中保持其自然排序。请分享任何建议,谢谢。
考虑在Spark数据框中有以下数据
raw data
----------------------------
name | item id | action
----------------------------
John | 120 | sell
----------------------------
John | 320 | buy
----------------------------
Jane | 120 | sell
----------------------------
Jane | 450 | buy
----------------------------
Sam | 360 | sell
----------------------------
Sam | 300 | hold
----------------------------
Sam | 450 | buy
----------------------------
Tim | 470 | buy
----------------------------
此表模式中有几条规则
1. Every one has at least one action `buy`
2. Every one's last action must be `buy` as well
现在我想添加一个序列列,只是为了向所有人显示操作顺序
expectation
--------------------------------------
name | item id | action | seq
--------------------------------------
John | 120 | sell | 1
--------------------------------------
John | 320 | buy | 2
--------------------------------------
Jane | 120 | sell | 1
--------------------------------------
Jane | 450 | buy | 2
--------------------------------------
Sam | 360 | sell | 1
--------------------------------------
Sam | 300 | hold | 2
--------------------------------------
Sam | 450 | buy | 3
--------------------------------------
Tim | 470 | buy | 1
--------------------------------------
这是我的代码
import org.apache.spark.sql.functions.{row_number}
import org.apache.spark.sql.expressions.Window
....
val df = spark.read.json(....)
val spec = Window.partitionBy($"name").orderBy(lit(1)) <-- don't know what to used for order by
val dfWithSeq = df.withColumn("seq", row_number.over(spec)) <--- please show me the magic
有趣的是,从dfWithSeq
返回的结果显示,每个人的行为都有随机的序列,因此使用seq时,行为不再遵循原始数据表中给出的顺序。但是我找不到解决方案。
actual result
--------------------------------------
name | item id | action | seq
--------------------------------------
John | 120 | sell | 1
--------------------------------------
John | 320 | buy | 2
--------------------------------------
Jane | 120 | sell | 2 <-- this is wrong
--------------------------------------
Jane | 450 | buy | 1 <-- this is wrong
--------------------------------------
Sam | 360 | sell | 1
--------------------------------------
Sam | 300 | hold | 2
--------------------------------------
Sam | 450 | buy | 3
--------------------------------------
Tim | 470 | buy | 1
--------------------------------------
答案 0 :(得分:1)
需要使用:
让您解决其余的问题。
答案 1 :(得分:1)
使用monotonically_increasing_id
。
import org.apache.spark.sql.functions.{row_number, monotonically_increasing_id}
import org.apache.spark.sql.expressions.Window
....
val df = spark.read.json(....)
val spec = Window.partitionBy($"name").orderBy($"order")
val dfWithSeq = df.withColumn("order", monotonically_increasing_id)
.withColumn("seq", row_number.over(spec))