Question

我有一张大桌子

我想将其更改为新表：id，date，last_state。

用熊猫很容易：

df['time_create'] = pd.to_datetime(df['time_create'])
df = df.set_index('time_create')
df = df.sort_index()
df = df.groupby('id').resample('D').last().reset_index()

但是用pyspark很难实现。

我知道：

pysaprk中的重采样等效于groupby + window：
```
grouped = df.groupBy('store_product_id', window("time_create", "1 day")).agg(sum("Production").alias('Sum Production'))
```
此处按store_product_id分组，在一天中重新采样并计算总和

分组依据并找到第一个或最后一个：

参阅https://stackoverflow.com/a/35226857/1637673

w = Window().partitionBy("store_product_id").orderBy(col("time_create").desc())
(df
  .withColumn("rn", row_number().over(w))
  .where(col("rn") == 1)
  .select("store_product_id", "time_create", "state"))

此groupby id并通过time_create获取最后一行。

但是我需要的是groupby id，按天重新采样，然后按time_create获取最后一行。

我知道如果使用pandas udf，Applying UDFs on GroupedData in PySpark (with functioning python example)

可能会解决此问题。

但是仅凭pyspark有什么方法可以做到这一点吗？

Answer 1

只需partitionBy("store_product_id", "date")即可完成

w = Window().partitionBy("store_product_id", "date").orderBy(col("time_create").desc())
x = (df
    .withColumn("rn", row_number().over(w))
    .where(col("rn") == 1)
    .select("store_product_id", "time_create", "state"))

熊猫中df.groupby（'id'）。resample（'D'）。last（）的Pyspark等效项

1 个答案: