我有一个如下数据框,并且使用row_number和Window函数创建了一个新列。但是每次我运行show()操作时,“ WindowIndex”都有不同的ID值。这与monotonically_increasing_id()相同。谁能帮助我如何创建一致的索引以连接两个数据框
df = sqlContext.createDataFrame([
(1000, "a"),
(1231, "b"),
(2221, "c"),
(2334, "d"),
(4124, "e"),
(5002, "c")
], ["id"])
w = Window().orderBy(lit('A'))
df = df.withColumn('windowIndex',row_number().over(w))
df.select(['id','windowIndex']).show(10)
+----+-----------+
| id|windowIndex|
+----+-----------+
|1000| 1|
|1231| 2|
|2221| 3|
|2334| 4|
|4124| 5|
|5002| 6|
+----+-----------+
df.select(['id','windowIndex']).show(10)
+----+-----------+
| id|windowIndex|
+----+-----------+
|2334| 1|
|4124| 2|
|5002| 3|
|1000| 4|
|1231| 5|
|2221| 6|
+----+-----------+
df = df.withColumn('monoIndex', monotonically_increasing_id())
df.select(['id','monoIndex']).show(10)
+----+----------+
| id| monoIndex|
+----+----------+
|1000| 0|
|1231| 1|
|2221| 2|
|2334|8589934592|
|4124|8589934593|
|5002|8589934594|
+----+----------+
df.select(['id','monoIndex']).show(10)
+----+----------+
| id| monoIndex|
+----+----------+
|1000| 0|
|1231| 1|
|2221| 2|
|2334|8589934592|
|4124|8589934593|
|5002|8589934594|
+----+----------+
我希望无论何时调用show()操作,索引都将保持一致。
+----+---------------+
| id| expected_Index|
+----+---------------+
|1000| 0|
|1231| 1|
|2221| 2|
|2334| 3|
|4124| 4|
|5002| 5|
+----+---------------+