如何向Spark DataFrame添加持久的行ID列 - #2

时间:2017-06-14 12:13:44

标签: apache-spark pyspark apache-spark-sql pyspark-sql

基本上我想要与this SO question中相同的东西。接受的答案表明该问题已通过Spark 2.0 / Spark 2.1修复。我正在使用Spark 2.1.1。

但是,我仍然遇到相同(类似)问题:我使用ID创建了一个pyspark.sql.functions.monotonically_increasing_id()列 - 期望此ID列在SparkSession期间始终标识一行。但是,它没有:

>>> import random
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql import functions as F

>>> spark = SparkSession.builder.getOrCreate()

>>> data = []

>>> for i in range(10000):
...     data.append((i, i//20, random.random()))

>>> df = spark.createDataFrame(data, ['ID', 'GID', 'DAT1'])
>>> df.printSchema()
root
 |-- ID: long (nullable = true)
 |-- GID: long (nullable = true)
 |-- DAT1: double (nullable = true)

>>> df = (df
...       .repartition('GID')
...       .orderBy('GID', 'ID')
...       .withColumn('_ID', F.monotonically_increasing_id())
...       )

>>> df.where('ID = 100').show()
+---+---+------------------+----------+
| ID|GID|              DAT1|       _ID|
+---+---+------------------+----------+
|100|  5|0.5893680390376791|8589934635|
+---+---+------------------+----------+
>>> df.where('ID = 100').show()
+---+---+------------------+----------+
| ID|GID|              DAT1|       _ID|
+---+---+------------------+----------+
|100|  5|0.5893680390376791|8589934640|
+---+---+------------------+----------+

只有在坚持df之后,结果似乎是“稳定的”:

>>> df = df.persist()
>>> df.where('ID = 100').show()
+---+---+------------------+----------+
| ID|GID|              DAT1|       _ID|
+---+---+------------------+----------+
|100|  5|0.5893680390376791|8589934638|
+---+---+------------------+----------+
>>> df.where('ID = 100').show()
+---+---+------------------+----------+
| ID|GID|              DAT1|       _ID|
+---+---+------------------+----------+
|100|  5|0.5893680390376791|8589934638|
+---+---+------------------+----------+

我的问题是:

  • 我错过了什么吗?
  • 这是不是,我的意思是persist()是好的解决问题,还是我期待更多问题?

0 个答案:

没有答案