我有一个pyspark数据框,如下所示:
col1 | col2 | col3
R a abc
R a abc
G b def
G b def
G b def
并且我想向其添加新列,该列将根据如下计数为这些唯一行生成索引:
col1 | col2 | col3 | new_column
R a abc 0
R a abc 1
G b def 0
G b def 1
G b def 2
请帮助我使用pyspark生成此新列。
谢谢!
答案 0 :(得分:3)
按列对数据进行分区,然后使用行号将值分配给新列。
from pyspark.sql.window import Window as W
from pyspark.sql import functions as F
windowSpec = W.partitionBy("col1","col2","col3").orderBy("col1","col2","col3")
df.withColumn("new_column", F.row_number().over(windowSpec)).show()