对于Pandas DataFrame,我正在寻找一种矢量化方法来计算每个给定组的视图数量的累积总和,超过一周前的视图除外。我已经尝试了各种应用功能,但我似乎无法上下7天来收集我需要的数据。
我有一个可以处理少量数据的函数,但由于它是一个循环,所以对所有数据都需要太长时间。有2500多个小组,每个小组都有大约100个日期填写。共有250.000+个记录。
我看过使用shift,但是因为并非所有组都填写了所有日期,所以这不起作用。我也尝试使用map功能,这看起来也太长了。
我拥有的Pandas DataFrame就是这个:
GROUP DAY VIEWS VIEWS_CUM
165 1 2011-09-18 82 82
166 1 2011-09-19 15 97
167 1 2011-12-21 29 126
168 1 2011-12-22 15 141
169 1 2011-12-23 2 143
170 2 2012-01-07 51 51
171 2 2012-01-08 10 61
172 2 2012-01-09 11 72
173 2 2012-01-17 33 105
174 2 2012-01-18 29 134
175 2 2012-01-19 6 140
我希望得到这样的东西:
GROUP DAY VIEWS VIEWS_CUM VIEWS_CUM_BEFORE
165 1 2011-09-18 82 82 0
166 1 2011-09-19 15 97 0
167 1 2011-12-21 29 126 29
168 1 2011-12-22 15 141 44
169 1 2011-12-23 2 143 46
170 2 2012-01-07 51 51 0
171 2 2012-01-08 10 61 0
172 2 2012-01-09 11 72 0
173 2 2012-01-17 33 105 33
174 2 2012-01-18 29 134 62
175 2 2012-01-19 6 140 68
似乎有效的功能,但速度太慢了:
import pandas as pd
from pandas.tseries.offsets import *
# Dict with data
data = {'DAY': {0: '09-18-11', 1: '09-19-11', 2: '12-21-11', 3: '12-22-11', 4: '12-23-11', 5: '01-07-12', 6: '01-08-12', 7: '01-09-12', 8: '01-17-12', 9: '01-18-12', 10: '01-19-12'}, 'GROUP': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2, 10: 2}, 'VIEWS': {0: 82, 1: 15, 2: 29, 3: 15, 4: 2, 5: 51, 6: 10, 7: 11, 8: 33, 9: 29, 10: 6}, 'VIEWS_CUM': {0: 82, 1: 97, 2: 126, 3: 141, 4: 143, 5: 51, 6: 61, 7: 72, 8: 105, 9: 134, 10: 140}}
# Convert dict to pandas dataframe
df = pd.DataFrame.from_dict(data)
# Make sure the DAY column is datetime
df['DAY'] = pd.to_datetime(df['DAY'])
# Group by GROUP and DAY
df= df.sort(['GROUP', 'DAY'])
# Default setting for VIEWS_CUM_BEFORE
df['VIEWS_CUM_BEFORE'] = 0
# Loop to add VIEWS_CUM_BEFORE
for index, row in df.iterrows():
views_cum_before_max = df.loc[(row['GROUP'] == df['GROUP']) &
(row['DAY'] >= df['DAY'] + Day(7))]['VIEWS_CUM'].max()
df.ix[index, 'VIEWS_CUM_BEFORE'] = row['VIEWS_CUM'] - views_cum_before_max
# If VIEWS_CUM_BEFORE is empty, make it 0
df['VIEWS_CUM_BEFORE'] = df['VIEWS_CUM_BEFORE'].fillna(0)
# Show result
df
答案 0 :(得分:0)
如果时间间隔大小相同,您可以执行以下操作:
import pandas as pd
from pandas.stats.moments import rolling_sum
def processGroup(df):
df = df.sort('DATE')
df['VIEWS_CUM_BEFORE'] = rolling_sum(df['VIEWS'], window = 7 * 7, min_periods = 1)
return df
df = df.groupby('GROUP').transform(processGroup)
答案 1 :(得分:0)
我将数据分组7天,累计总和在VIEWS_CUM_BEFORE
列。
df = df.drop(['VIEWS_CUM'], axis=1)
df['VIEWS_CUM_BEFORE'] = df.groupby([pd.Grouper(freq='7D',key='DAY'),'GROUP']).cumsum()
df['VIEWS_CUM_BEFORE'] = df.groupby([pd.Grouper(freq='7D',key='DAY'),'GROUP'])['VIEWS'].cumsum()
df['VIEWS_CUM_BEFORE'] = df.groupby([pd.Grouper(freq='7D',key='DAY'),'GROUP'])['VIEWS'].apply(np.cumsum)
但cumsum
计算第一个子群,需要0
个值。
GROUP DAY VIEWS VIEWS_CUM_BEFORE
0 1 2011-09-18 82 82
1 1 2011-09-19 15 97
2 1 2011-12-21 29 29
3 1 2011-12-22 15 44
4 1 2011-12-23 2 46
5 2 2012-01-07 51 51
6 2 2012-01-08 10 10
7 2 2012-01-09 11 21
8 2 2012-01-17 33 33
9 2 2012-01-18 29 62
10 2 2012-01-19 6 68
我们必须找到最小DAY
组,添加7天,然后如果这一天较低,则将其设为0。
def repeat_value(grp):
grp['DAY2'] = grp['DAY'].min() + pd.Timedelta('7 days')
return grp
df = df.groupby(['GROUP']).apply(repeat_value)
print df
GROUP DAY VIEWS VIEWS_CUM_BEFORE DAY2
0 1 2011-09-18 82 82 2011-09-25
1 1 2011-09-19 15 97 2011-09-25
2 1 2011-12-21 29 29 2011-09-25
3 1 2011-12-22 15 44 2011-09-25
4 1 2011-12-23 2 46 2011-09-25
5 2 2012-01-07 51 51 2012-01-14
6 2 2012-01-08 10 10 2012-01-14
7 2 2012-01-09 11 21 2012-01-14
8 2 2012-01-17 33 33 2012-01-14
9 2 2012-01-18 29 62 2012-01-14
10 2 2012-01-19 6 68 2012-01-14
df.loc[df['DAY2'] > df['DAY'], 'VIEWS_CUM_BEFORE'] = 0
del df['DAY2']
print df
GROUP DAY VIEWS VIEWS_CUM_BEFORE
0 1 2011-09-18 82 0
1 1 2011-09-19 15 0
2 1 2011-12-21 29 29
3 1 2011-12-22 15 44
4 1 2011-12-23 2 46
5 2 2012-01-07 51 0
6 2 2012-01-08 10 0
7 2 2012-01-09 11 0
8 2 2012-01-17 33 33
9 2 2012-01-18 29 62
10 2 2012-01-19 6 68