Question

重复运行以下代码会产生不一致的结果。到目前为止，我只看到了两个输出。在切换到其他结果之前，结果会重复任意一次，然后在重新切换之前也会重复任意随机次数。

为什么会这样？

在这个例子中我可以使用索引窗口函数并在我使用orderBy()修改单列之前包含%，但是我的真实例子，我没有这个选项，所以这是对我来说不是解决方案。

import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
import pyspark.sql.functions as F 
from pyspark.sql.window import Window as W

window = W.rowsBetween(W.unboundedPreceding, W.currentRow)
testCol = [tuple([x]) for x in range(1,5000)]

# repeatedly re-run from here:
testDF = (spark.createDataFrame(testCol,['testCol'])
              .withColumn('testCol',
                          F.when(F.col('testCol') % 2 == 0, 
                             F.col('testCol'))
                          .otherwise(0.0))              
              .withColumn('int', F.lit(1))
              .withColumn('index', F.sum('int').over(window))
              .drop('int') 
)

testDF.show()

结果1（预期）：

+-------+-----+
|testCol|index|
+-------+-----+
|    0.0|    1|
|    2.0|    2|
|    0.0|    3|
|    4.0|    4|
|    0.0|    5|
|    6.0|    6|
|    0.0|    7|
|    8.0|    8|
|    0.0|    9|
|   10.0|   10|
|    0.0|   11|
|   12.0|   12|
|    0.0|   13|
|   14.0|   14|
|    0.0|   15|
|   16.0|   16|
|    0.0|   17|
|   18.0|   18|
|    0.0|   19|
|   20.0|   20|
+-------+-----+
only showing top 20 rows

结果2（未预料到）：

+-------+-----+
|testCol|index|
+-------+-----+
|    0.0|    1|
| 2050.0|    2|
|    0.0|    3|
| 2052.0|    4|
|    0.0|    5|
| 2054.0|    6|
|    0.0|    7|
| 2056.0|    8|
|    0.0|    9|
| 2058.0|   10|
|    0.0|   11|
| 2060.0|   12|
|    0.0|   13|
| 2062.0|   14|
|    0.0|   15|
| 2064.0|   16|
|    0.0|   17|
| 2066.0|   18|
|    0.0|   19|
| 2068.0|   20|
+-------+-----+
only showing top 20 rows

此代码也会产生完全相同的不一致输出：

testDF = (spark.createDataFrame(testCol,['testCol'])
              .repartition(1) # to address how monotonically_increasing_id works
              .withColumn('id', F.monotonically_increasing_id())            
              .withColumn('testCol',
                          F.when(F.col('testCol') % 2 == 0, 
                             F.col('testCol'))
                          .otherwise(0.0))              
)

testDF.show()

pyspark中的结果不一致

0 个答案: