Question

在pyspark中，当使用count（）。over（window）时，如果窗口定义中有orderBy，则结果不正确。不知道这是一个错误还是有更好的方法来解决。

比较具有不同窗口定义的同一个组，一个与orderBy，另一个与之不同。他们显示了不同的结果。没有orderBy的窗口定义具有预期的结果。

from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
cschema = StructType([StructField('customer',StringType()),StructField('sales', IntegerType())])
data = [
    ['Bob',20],
    ['Bob',30],
    ['Bob',22],
    ['John',33],
    ['John', 18],
    ['Bob', 30],
    ['John', 18]]
test_df = spark.createDataFrame(data, schema = cschema)
test_df.show()

+--------+-----+
|customer|sales|
+--------+-----+
|     Bob|   20|
|     Bob|   30|
|     Bob|   22|
|    John|   33|
|    John|   18|
|     Bob|   30|
|    John|   18|
+--------+-----+

win_ordered = Window.partitionBy('customer').orderBy(col('sales'))
win_non_ordered = Window.partitionBy('customer')
test_df.withColumn('cnt1', count(col('sales')).over(win_ordered)).withColumn('cnt2', count(col('sales')).over(win_non_ordered)).show()

+--------+-----+----+----+
|customer|sales|cnt1|cnt2|
+--------+-----+----+----+
|     Bob|   20|   1|   4|
|     Bob|   22|   2|   4|
|     Bob|   30|   4|   4|
|     Bob|   30|   4|   4|
|    John|   18|   2|   3|
|    John|   18|   2|   3|
|    John|   33|   3|   3|
+--------+-----+----+----+

我希望“ cnt1”列在整个组中具有相同的值，就像“ cnt2”列一样。

当窗口定义中存在orderBy时，窗口函数count（）无法正常工作

0 个答案: