Pyspark - window.PartitionBy() - 分区数

时间:2017-02-28 16:51:42

标签: pyspark partitioning

我正在使用PySpark v1.6.2,我的代码是这样的:

df = sqlContext.sql('SELECT * FROM <lib>.<table>')
df.rdd.numPartitions() ## 2496
df = df.withColumns('count', lit(1)) ## up to this point it still has 2496 partitions
df = df.repartition(2496,'trip_id').sortWithinPartitions('trip_id','time')
# This is where the trouble starts
sequenceWS = Window.partitionBy('trip_id').orderBy('trip_id','time') ## Defining a window
df = df.withColumn('delta_time', (df['time'] - min(df['time']).over(sequenceWS.rowsBetween(-1, 0))))
# Done with window function
df.rdd.numPartitions() ## 200

我的问题是:

有没有办法告诉pyspark使用函数Window.partionBy(*cols)时应该做多少分区?

或者,有没有办法影响PySpark以保持与在其DataFrame上应用窗口函数之前相同数量的分区?

0 个答案:

没有答案