如何在Spark Dataframe中实现类似于oracle rownum

时间:2017-02-16 13:01:25

标签: apache-spark

我是新来的火花,我正在尝试为大型数据集创建一个rownumer。 我尝试使用row_number窗口函数,它工作正常但效率不高,因为我没有使用partitionBy子句。

例如:

 val df= Seq(
        ("041", false),
        ("042", false),
        ("043", false)
      ).toDF("id", "flag")

结果应该是:

val df= Seq(
        ("041", false,1),
        ("042", false,2),
        ("043", false,3)
      ).toDF("id", "flag","rownum")

目前我正在使用

df.withColumn("rownum",row_number().over(Window.orderBy($"id")))

有没有其他方法可以在不使用窗口函数的情况下实现此结果? 我也尝试过monotonicallyIncresingID和ZipwithIndex

1 个答案:

答案 0 :(得分:1)

您可以使用monotonicallyIncreasingId获取rowNum功能

val df2 = df.withColumn("rownum",monotonicallyIncreasingId)

此处索引将从0开始。

以1开始索引,一个为monotonicallyIncreasingId

添加+1

val df2 = df.withColumn("rownum",monotonicallyIncreasingId+1)

scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+


scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+


scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+


scala> var df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+


scala> df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+