基本上我想要与this SO question中相同的东西。接受的答案表明该问题已通过Spark 2.0 / Spark 2.1修复。我正在使用Spark 2.1.1。
但是,我仍然遇到相同(类似)问题:我使用ID
创建了一个pyspark.sql.functions.monotonically_increasing_id()
列 - 期望此ID
列在SparkSession期间始终标识一行。但是,它没有:
>>> import random
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import functions as F
>>> spark = SparkSession.builder.getOrCreate()
>>> data = []
>>> for i in range(10000):
... data.append((i, i//20, random.random()))
>>> df = spark.createDataFrame(data, ['ID', 'GID', 'DAT1'])
>>> df.printSchema()
root
|-- ID: long (nullable = true)
|-- GID: long (nullable = true)
|-- DAT1: double (nullable = true)
>>> df = (df
... .repartition('GID')
... .orderBy('GID', 'ID')
... .withColumn('_ID', F.monotonically_increasing_id())
... )
>>> df.where('ID = 100').show()
+---+---+------------------+----------+
| ID|GID| DAT1| _ID|
+---+---+------------------+----------+
|100| 5|0.5893680390376791|8589934635|
+---+---+------------------+----------+
>>> df.where('ID = 100').show()
+---+---+------------------+----------+
| ID|GID| DAT1| _ID|
+---+---+------------------+----------+
|100| 5|0.5893680390376791|8589934640|
+---+---+------------------+----------+
只有在坚持df之后,结果似乎是“稳定的”:
>>> df = df.persist()
>>> df.where('ID = 100').show()
+---+---+------------------+----------+
| ID|GID| DAT1| _ID|
+---+---+------------------+----------+
|100| 5|0.5893680390376791|8589934638|
+---+---+------------------+----------+
>>> df.where('ID = 100').show()
+---+---+------------------+----------+
| ID|GID| DAT1| _ID|
+---+---+------------------+----------+
|100| 5|0.5893680390376791|8589934638|
+---+---+------------------+----------+
我的问题是:
persist()
是好的解决问题,还是我期待更多问题?