我有一个Pandas数据帧如下
df = pd.DataFrame([['John', '1/1/2017','10'],
['John', '2/2/2017','15'],
['John', '2/2/2017','20'],
['John', '3/3/2017','30'],
['Sue', '1/1/2017','10'],
['Sue', '2/2/2017','15'],
['Sue', '3/2/2017','20'],
['Sue', '3/3/2017','7'],
['Sue', '4/4/2017','20']
],
columns=['Customer', 'Deposit_Date','DPD'])
。在下面的屏幕截图中计算PreviousMean列的最佳方法是什么?
该列是该客户的DPD年初至今的平均值。即包括所有DPD,但不包括与当前存款日期匹配的行。如果之前没有记录,那么它为空或0。
备注:
答案 0 :(得分:0)
以下是从平均计算中排除重复天数的一种方法:
# create helper series which is NaN for repeated days, DPD otherwise
s = df.groupby(['Customer Name', 'Deposit_Date']).cumcount() == 1
df['DPD2'] = np.where(s, np.nan, df['DPD'])
# apply pd.expanding_mean
df['CumMean'] = df.groupby(['Customer Name'])['DPD2'].apply(lambda x: pd.expanding_mean(x))
# drop helper series
df = df.drop('DPD2', 1)
print(df)
Customer Name Deposit_Date DPD CumMean
0 John 01/01/2017 10 10.0
1 John 01/01/2017 10 10.0
2 John 02/02/2017 20 15.0
3 John 03/03/2017 30 20.0
4 Sue 01/01/2017 10 10.0
5 Sue 01/01/2017 10 10.0
6 Sue 02/02/2017 20 15.0
7 Sue 03/03/2017 30 20.0
答案 1 :(得分:0)
而不是分组&扩展均值,在条件上过滤数据帧,并计算DPD
的平均值:
Customer
==当前行' s Customer
Deposit_Date
<当前行Deposit_Date
使用df.apply
对数据框中的所有行执行此操作:
df['PreviousMean'] = df.apply(
lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(),
axis=1)
输出:
Customer Deposit_Date DPD PreviousMean
0 John 2017-01-01 10 NaN
1 John 2017-02-02 15 10.0
2 John 2017-02-02 20 10.0
3 John 2017-03-03 30 15.0
4 Sue 2017-01-01 10 NaN
5 Sue 2017-02-02 15 10.0
6 Sue 2017-03-02 20 12.5
7 Sue 2017-03-03 7 15.0
8 Sue 2017-04-04 20 13.0
答案 2 :(得分:0)
好的,这是我迄今为止提出的最佳解决方案。
诀窍是首先在客户和客户处创建聚合表。存款日期等级包含移位均值。要计算这个意思,你必须先计算总和和计数。
{{1}}
答案 3 :(得分:0)