我有一个数据框,我想添加一个索引列以根据其中一列进行重置
--------------------
| ColA | ColB |
====================
| G1 | 10 |
--------------------
| G1 | 20 |
--------------------
| G2 | 50 |
--------------------
| G2 | 10 |
--------------------
| G2 | 70 |
--------------------
我希望结果是
-----------------------------
| ColA | ColB | ColC |
=============================
| G1 | 10 | 1 |
-----------------------------
| G1 | 20 | 2 |
-----------------------------
| G2 | 50 | 1 | <== reset because ColA changed
-----------------------------
| G2 | 10 | 2 |
-----------------------------
| G2 | 70 | 3 |
-----------------------------
有没有类似的东西 df.withColumn(“ id”,单调递增ID) 合适吗?
答案 0 :(得分:2)
使用Window
为colA
列创建分区。
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("ColA").orderBy("ColB")
df.withCloumn("id", row_number.over(w))
或者,如果您想保留原始的行顺序,
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("ColA").orderBy("temp")
df.withColumn("temp", monotonically_increasing_id)
.withCloumn("id", row_number.over(w))
.drop("temp")