我是新来的火花,我正在尝试为大型数据集创建一个rownumer。 我尝试使用row_number窗口函数,它工作正常但效率不高,因为我没有使用partitionBy子句。
例如:
val df= Seq(
("041", false),
("042", false),
("043", false)
).toDF("id", "flag")
结果应该是:
val df= Seq(
("041", false,1),
("042", false,2),
("043", false,3)
).toDF("id", "flag","rownum")
目前我正在使用
df.withColumn("rownum",row_number().over(Window.orderBy($"id")))
有没有其他方法可以在不使用窗口函数的情况下实现此结果? 我也尝试过monotonicallyIncresingID和ZipwithIndex
答案 0 :(得分:1)
您可以使用monotonicallyIncreasingId
获取rowNum功能
val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
此处索引将从0开始。
以1开始索引,一个为monotonicallyIncreasingId
val df2 = df.withColumn("rownum",monotonicallyIncreasingId+1)
scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]
scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false| 0|
|042|false| 1|
|043|false| 2|
+---+-----+------+
scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]
scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false| 0|
|042|false| 1|
|043|false| 2|
+---+-----+------+
scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]
scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false| 0|
|042|false| 1|
|043|false| 2|
+---+-----+------+
scala> var df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]
scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false| 0|
|042|false| 1|
|043|false| 2|
+---+-----+------+
scala> df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]
scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false| 0|
|042|false| 1|
|043|false| 2|
+---+-----+------+