Pyspark:在多个列上有效地应用滚动窗口功能

时间:2020-08-13 10:15:25

标签: pyspark

我有一个包含多个列的数据集,我想为每个列应用一些功能。一个例子

列:['source_bytes','source_packets','rate']

功能:['avg','stddev']

结果将是一个移动的窗口,该窗口将生成名为

的新列

source_bytes_avg,source_bytes_stddev,source_packets_avg,source_packets_stddev

我已经做好了滚动窗口的准备,但想知道如何有效地将其应用于许多列

w = (Window()
     .partitionBy("source_ip")
     .orderBy(F.col("timestamp"))
     .rangeBetween(-1800, 0))

flows_filtered_v2_df2 = flows_filtered_v2_df.withColumn("timestamp", F.unix_timestamp(F.to_timestamp("start_time")))\
    .withColumn("src_bytes_avg_30min", F.avg("source_bytes").over(w))\
    .withColumn("src_bytes_std_30min", F.stddev("source_bytes").over(w))

0 个答案:

没有答案