我正在尝试创建一个pyspark函数,该函数可以对数据帧中的不同窗口执行多项操作,例如平均值,标准偏差。
该函数将定义数据帧,要执行的操作的列名,要考虑的窗口以及应考虑进行排序的列。
下面是我尝试过的代码
data = [
('a', '2017-01-01',800,10),
('a', '2017-01-02',800,16),
('a', '2017-01-03',300,91),
('a', '2017-01-04',150,34),
('a', '2017-01-05',300,23),
('a', '2017-01-06',500,87),
('a', '2017-01-07',800,90),
('a','2017-01-08',600,35),
('a', '2017-01-09',400,24),
('a', '2017-01-10',800,97),
('a', '2017-01-11',900,21),
('b', '2017-01-01',800,96),
('b', '2017-01-02',800,99),
('b', '2017-01-03',300,23),
('b', '2017-01-04',150,64),
('b', '2017-01-05',300,42),
('b', '2017-01-06',500,54),
('b', '2017-01-07',800,95),
('b','2017-01-08',600,70),
('b', '2017-01-09',400,64),
('b', '2017-01-10',800,53),
('b', '2017-01-11',900,87)
]
data = spark.createDataFrame(data, ['cd','week_strt_date','spend','no_trx'])
display(data)
def get_synthetics(df,col_name,ops,win_weeks,order_cols):
win_per = (Window.partitionBy('cd')
.orderBy('week_strt_date')
.rowsBetween(-win_per,0))
df = (data.withColumn('ma',
(*[f.ops(colName).alias(colName) for colName in ops_cols]).over(win_per))
.orderBy(order_cols))
return df
aa = get_synthetics(df=data,ops_cols = ['spend','no_trx'],ops=['avg','std],win_weeks=[4,6],order_cols=['cd','week_strt_date'])
我想做的是获取过去4周和6周内的列支出和no_trx的移动平均值。还必须为两个窗口中的相同两列计算标准偏差。
因此最终输出将包含列
在一个数据帧中cd , week_strt_date, 4 week moving average of spend, 6 week moving average of spend, 4 week moving average of no_trx, 6 week moving average of no_trx, 4 week std dev of spend, 6 week std dev of spend, 4 week std dev of no_trx, 6 week std dev of no_trx