Pyspark-monotonically_increasing_id(),row_number不给出常数索引

时间:2019-05-15 21:38:11

标签: pyspark

我有一个如下数据框,并且使用row_number和Window函数创建了一个新列。但是每次我运行show()操作时,“ WindowIndex”都有不同的ID值。这与monotonically_increasing_id()相同。谁能帮助我如何创建一致的索引以连接两个数据框

df = sqlContext.createDataFrame([
    (1000, "a"),
    (1231, "b"),
    (2221, "c"),
    (2334, "d"),
    (4124, "e"),
    (5002, "c")
], ["id"])

w = Window().orderBy(lit('A'))
df = df.withColumn('windowIndex',row_number().over(w))

df.select(['id','windowIndex']).show(10)
+----+-----------+
|  id|windowIndex|
+----+-----------+
|1000|          1|
|1231|          2|
|2221|          3|
|2334|          4|
|4124|          5|
|5002|          6|
+----+-----------+

df.select(['id','windowIndex']).show(10)
+----+-----------+
|  id|windowIndex|
+----+-----------+
|2334|          1|
|4124|          2|
|5002|          3|
|1000|          4|
|1231|          5|
|2221|          6|
+----+-----------+

df = df.withColumn('monoIndex', monotonically_increasing_id())
df.select(['id','monoIndex']).show(10)
+----+----------+
|  id| monoIndex|
+----+----------+
|1000|         0|
|1231|         1|
|2221|         2|
|2334|8589934592|
|4124|8589934593|
|5002|8589934594|
+----+----------+

df.select(['id','monoIndex']).show(10)

+----+----------+
|  id| monoIndex|
+----+----------+
|1000|         0|
|1231|         1|
|2221|         2|
|2334|8589934592|
|4124|8589934593|
|5002|8589934594|
+----+----------+

我希望无论何时调用show()操作,索引都将保持一致。

+----+---------------+
|  id| expected_Index|
+----+---------------+
|1000|              0|
|1231|              1|
|2221|              2|
|2334|              3|
|4124|              4|
|5002|              5|
+----+---------------+

0 个答案:

没有答案