我有一个数据帧df_pct_Max,其形状如下:
Date Value1 Value2
01.01.2015 5 6
08.01.2015 3 2
... ... ...
28.01.2017 7 8
我想计算每个日历周的平均值,并从日历周的实际值中减去它。
我创建了一个数据框,其中包含每个日历周的平均值,如下所示:
df_weekly_avg_Max = df_pct_Max.groupby(df_pct_Max.index.week).mean()
这会产生数据帧df_weekly_avg_Max:
KW Value1 Value2
1 3.5 4.3
2 4 3
… … …
52 8.33 6.2
现在我正在尝试从df_pct_Max中删除df_weekly_avg_Max,并希望在日历周之前完成此操作。
我尝试添加“KW”列然后
dfresult = df_pct_Max.sub(df_weekly_avg_Max, axis='KW')
但我在那里得到了错误。
是否还有一种方法可以滚动(从2015年第1周和2016年开始,过去3年中平均每周第1周的日历...)? 有人可以帮忙解决这个问题吗?
答案 0 :(得分:1)
这个答案并不干净,因为它没有很好地利用熊猫,但我也认为它不会很慢(取决于你的数据帧有多大),基本的想法是建立一个列表这些方法每天重复一次,所以你可以简单地减去。
CODE:
from collections import Counter
import pandas as pd
import numpy as np
#Build up example data frame
num_days = 15
dates = pd.date_range('1/1/2015', periods=num_days, freq='D')
val1s = np.random.random_integers(1, 30, num_days)
val2s = np.random.random_integers(1, 30, num_days)
df_pct_MAX = pd.DataFrame({'Date':dates, 'Value1':val1s, 'Value2':val2s})
df_pct_MAX['Day'] = df_pct_MAX['Date'].dt.weekday_name
df_pct_MAX['Week'] = df_pct_MAX['Date'].dt.week
#OPs logic to get means
df_weekly_avg_Max = df_pct_MAX.groupby(df_pct_MAX['Week']).mean()
#Build up a list of the means repeated once for each day in that week
mean_fields = ['Value1','Value2'] #<-- only hardcoded portion
means_dict = {k:list(df_weekly_avg_Max[k]) for k in mean_fields} #<-- convert means into lists keyed by field
week_counts = Counter(df_pct_MAX['Week']).values() #<-- count how many days are represented in each week
#Build up a dict keyed by field with the means repeated the correct number of times
means = {k:[means_dict[k][i] for i,count in enumerate(week_counts)
for x in range(count)] for k in mean_fields}
#Assign a new column to the means for each field (not necessary, just to show done correctly)
for k in mean_fields:
df_pct_MAX[k+'Mean'] = means[k]
print(df_pct_MAX)
输出:
Date Value1 Value2 Day Week Value1Mean Value2Mean
0 2015-01-01 12 19 Thursday 1 9.000000 19.250000
1 2015-01-02 15 27 Friday 1 9.000000 19.250000
2 2015-01-03 2 30 Saturday 1 9.000000 19.250000
3 2015-01-04 7 1 Sunday 1 9.000000 19.250000
4 2015-01-05 6 20 Monday 2 17.571429 14.142857
5 2015-01-06 9 24 Tuesday 2 17.571429 14.142857
6 2015-01-07 25 17 Wednesday 2 17.571429 14.142857
7 2015-01-08 22 8 Thursday 2 17.571429 14.142857
8 2015-01-09 30 7 Friday 2 17.571429 14.142857
9 2015-01-10 10 1 Saturday 2 17.571429 14.142857
10 2015-01-11 21 22 Sunday 2 17.571429 14.142857
11 2015-01-12 23 29 Monday 3 23.750000 19.750000
12 2015-01-13 23 16 Tuesday 3 23.750000 19.750000
13 2015-01-14 21 17 Wednesday 3 23.750000 19.750000
14 2015-01-15 28 17 Thursday 3 23.750000 19.750000
答案 1 :(得分:1)
我找到了整个数据帧的解决方案。 我添加了一个专栏&#39; KW&#39;对于日历周,然后使用lambda函数对其执行groupby,该函数减去日历周的平均值&#34; 1&#34;从日历周的当前值&#34; 1&#34; ...
df_pct_Max ['KW'] = df_pct_Max.index.week
dfresult = df_pct_Max.groupby(by='KW').transform(lambda x: x-x.mean())
这适合我。
能够调整平均值的时间范围会更好,例如我减去当前的日历周&#34; 1&#34;估算过去3年左右的日历周的平均值。但这似乎相当复杂,这个解决方案适用于当前的分析。