我正在使用PySpark v1.6.2,我的代码是这样的:
df = sqlContext.sql('SELECT * FROM <lib>.<table>')
df.rdd.numPartitions() ## 2496
df = df.withColumns('count', lit(1)) ## up to this point it still has 2496 partitions
df = df.repartition(2496,'trip_id').sortWithinPartitions('trip_id','time')
# This is where the trouble starts
sequenceWS = Window.partitionBy('trip_id').orderBy('trip_id','time') ## Defining a window
df = df.withColumn('delta_time', (df['time'] - min(df['time']).over(sequenceWS.rowsBetween(-1, 0))))
# Done with window function
df.rdd.numPartitions() ## 200
我的问题是:
有没有办法告诉pyspark使用函数Window.partionBy(*cols)
时应该做多少分区?
或者,有没有办法影响PySpark
以保持与在其DataFrame上应用窗口函数之前相同数量的分区?