我有一个名为'df'的DataFrame,如下所示:
+-------+-------+-------+
| Atr1 | Atr2 | Atr3 |
+-------+-------+-------+
| A | A | A |
+-------+-------+-------+
| B | A | A |
+-------+-------+-------+
| C | A | A |
+-------+-------+-------+
我想使用增量值为其添加新列,并获取以下更新的DataFrame:
+-------+-------+-------+-------+
| Atr1 | Atr2 | Atr3 | Atr4 |
+-------+-------+-------+-------+
| A | A | A | 1 |
+-------+-------+-------+-------+
| B | A | A | 2 |
+-------+-------+-------+-------+
| C | A | A | 3 |
+-------+-------+-------+-------+
我怎么能得到它?
答案 0 :(得分:8)
如果您只需要增量值(如ID)并且没有任何数字需要连续的约束,则可以使用monotonically_increasing_id()
。使用此函数时唯一的保证是每行的值都会增加,但每次执行时它们自身的值会有所不同。
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("Atr4", monotonically_increasing_id())