我有以下客户销售历史数据框(它只是其中的一部分,实际数据框超过70k行):
import pandas as pd
import datetime as DT
df_test = pd.DataFrame({
'Cus_ID': ["T313","T348","T313","T348","T313","T348","T329","T329","T348","T313","T329","T348"],
'Value': [3,2,3,4,5,3,7.25,10.25,4.5,11.75,6.25,6],
'Date' : [
DT.datetime(2015,10,18),
DT.datetime(2015,11,14),
DT.datetime(2015,11,18),
DT.datetime(2015,12,13),
DT.datetime(2015,12,19),
DT.datetime(2016,1,24),
DT.datetime(2016,1,31),
DT.datetime(2016,2,17),
DT.datetime(2016,3,28),
DT.datetime(2016,3,31),
DT.datetime(2016,4,3),
DT.datetime(2016,4,16),
]})
我想在数据框中添加一个新列,以显示该客户过去90天的时间加权平均值结果。
预期结果(专栏Value_Result
):
Cus_ID Date Value Value_Result
0 T313 2015-10-18 3.00 NaN (No 90days history)
1 T348 2015-11-14 2.00 NaN (No 90days history)
2 T313 2015-11-18 3.00 3 (3*31)/31
3 T348 2015-12-13 4.00 2 (2*29)/29
4 T313 2015-12-19 5.00 3 (3*62+3*31)/(62+31)
5 T348 2016-01-24 3.00 2.743 (4*42+2*71)/(42+71)
6 T329 2016-01-31 7.25 NaN (No 90days history)
7 T329 2016-02-17 10.25 7.25 (7.25*17)/17
8 T348 2016-03-28 4.50 3 (3*64)/64
9 T313 2016-03-31 11.75 NaN (No 90days history)
10 T329 2016-04-03 6.25 8.516 (10.25*46+7.25*63)/(46+63)
11 T348 2016-04-16 6.00 3.279 (4.5*19+3*83)/(19+83)
我已经尝试使用groupby('Cus_ID')
和滚动应用,但我很难将函数编写为仅考虑90天后。
任何输入都受到高度赞赏。
答案 0 :(得分:1)
我不确定滚动功能是加权平均值的方法,尽管可能有其他人知道如何使用它 我不能保证这将是最优化的方法,但它会产生你想要的结果,你可以采取这个并在必要时进行构建。
非常感谢pbpython article。我建议通读它。
我的方法是创建一个将应用于组的功能(按Cus_ID分组)。此函数将迭代该组中的行并按上述方法执行加权平均,将其应用回组并返回组。此代码段的详细信息是为了清晰显示,您可以根据需要删除所有变量的创建来修剪它。
apply函数看起来像这样
def tw_avg(group, value_col, time_col, new_col_name="time_weighted_average", days_back='-90 days', fill_value=np.nan):
"""
Will calculate the weighted (by day) time average of the group passed.
It will not operate on the day it is evaulating but the previous days_back.
Should be used with the apply() function in Pandas with groupby function
Args:
group (pandas.DataFrame): Will be passed by pandas
value_col (str): Name of column with value to be averaged by weight
time_col (str): Name of column of with times in them
new_col_name (str): Name of new column to place time weighted average into, default: time_weighted_average
days_back (str): Time delta description as described in panda time deltas documentation, default: -90 days
fill_value (any): The value to fill rows which do not have data in days_back period, default: np.nan
Returns:
(pandas.DataFrame): The modified DataFrame with time weighted average added to columns, np.nan if no
time weight average exist
"""
for idx, row in group.iterrows():
# Filter for only values that are days_back for averaging.
days_back_fil = (group[time_col] < row[time_col]) & (group[time_col] >= row[time_col] + pd.Timedelta(days_back))
df = group[days_back_fil]
df['days-back'] = (row[time_col] - df[time_col]) / np.timedelta64(1, 'D') # need to divide by np.timedelta day to get number back
df['weight'] = df[value_col] * df['days-back']
try:
df['tw_avg'] = df['weight'].sum() / df['days-back'].sum()
time_avg = df['tw_avg'].iloc[0] # Get single value of the tw_avg
group.loc[idx, new_col_name] = time_avg
except ZeroDivisionError:
group.loc[idx, new_col_name] = fill_value
return group
然后,您可以使用此行返回您正在寻找的DataFrame
df_test.groupby(by=['Cus_ID']).apply(tw_avg, 'Value', 'Date')
这将产生,
Cus_ID Date Value time_weighted_average
0 T313 2015-10-18 3.0 NaN
1 T348 2015-11-14 2.0 NaN
2 T313 2015-11-18 3.0 3.0
3 T348 2015-12-13 4.0 2.0
4 T313 2015-12-19 5.0 3.0
5 T348 2016-01-24 3.0 2.743362831858407
6 T329 2016-01-31 7.25 NaN
7 T329 2016-02-17 10.25 7.25
8 T348 2016-03-28 4.5 3.0
9 T313 2016-03-31 11.75 NaN
10 T329 2016-04-03 6.25 8.51605504587156
11 T348 2016-04-16 6.0 3.2794117647058822
您现在可以使用该函数将加权平均值应用于具有value_col
参数的其他值列,或者使用days_back
参数更改时间窗口长度。有关如何描述时间增量的信息,请参阅pandas time deltas页面。