将重置索引添加到Spark数据框

时间:2020-03-18 12:02:10

标签: scala dataframe apache-spark

我有一个数据框,我想添加一个索引列以根据其中一列进行重置

--------------------
|  ColA   |  ColB  |
====================
|  G1     |  10    |
--------------------
|  G1     |  20    |
--------------------
|  G2     |  50    |
--------------------
|  G2     |  10    |
--------------------
|  G2     |  70    |
--------------------

我希望结果是

-----------------------------
|  ColA   |  ColB  |  ColC  |
=============================
|  G1     |  10    |   1    |
-----------------------------
|  G1     |  20    |   2    |
-----------------------------
|  G2     |  50    |   1    |   <== reset because ColA changed
-----------------------------
|  G2     |  10    |   2    |
-----------------------------
|  G2     |  70    |   3    |
-----------------------------

有没有类似的东西 df.withColumn(“ id”,单调递增ID) 合适吗?

1 个答案:

答案 0 :(得分:2)

使用WindowcolA列创建分区。

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy("ColA").orderBy("ColB")
df.withCloumn("id", row_number.over(w))

或者,如果您想保留原始的行顺序,

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy("ColA").orderBy("temp")
df.withColumn("temp", monotonically_increasing_id)
  .withCloumn("id", row_number.over(w))
  .drop("temp")