我通过大量唯一的ID进行以下操作,以基于当前+之前的访问进行迭代并创建摘要统计信息。尽管这适用于少量数据,但是对于较大的数据集,此代码可能会很长。有没有一种更快的方法来解决这个问题(不使用多重处理)?
import pandas as pd
d = {
'id': ['A','B', 'B', 'C'],
'visit_id': ['asd', 'awd', 'qdw', 'qwb'],
'value': [-343.68, 343.68, -55.2, 55.2]}
df = pd.DataFrame(data=d)
agg_users = pd.DataFrame()
for i in df['id'].unique():
user_tbl = df.loc[df['id']==i]
user_tbl.insert(0, 'visit_sequence', range(0, 0 + len(user_tbl)))
agg_sessions = pd.DataFrame()
for i in user_tbl['visit_sequence']:
tmp = user_tbl.loc[user_tbl['visit_sequence'] <= i]
ses = tmp.loc[user_tbl['visit_sequence'] == i, 'visit_id'].item()
aggs = {
'value': ['min', 'max', 'mean']
}
tmp2 = tmp.groupby('id').agg(aggs)
new_columns = [k + '_' + agg for k in aggs.keys() for agg in aggs[k]]
tmp2.columns = new_columns
tmp2.reset_index(inplace=True)
tmp2.insert(1, 'visit_id', ses)
agg_sessions = pd.concat([agg_sessions, tmp2])
agg_users = pd.concat([agg_users, agg_sessions])
agg_users
答案 0 :(得分:1)
基于代码的输出,我认为您正在寻找扩展窗口聚合; docs。
由于this GitHub issue中记录的df.groupby('colname').expanding().agg()
中的一个熊猫错误,以下解决方案有些笨拙。
# First, sort by id, then visit_id before grouping by id.
# Pandas groupby preserves the order of rows within each group:
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
df.sort_values(['id', 'visit_id'], inplace=True)
# Calculate expanding-window aggregations for each id
aggmin = df.groupby('id').expanding()['value'].min().to_frame(name='value_min')
aggmax = df.groupby('id').expanding()['value'].max().to_frame(name='value_max')
aggmean = df.groupby('id').expanding()['value'].mean().to_frame(name='value_mean')
# Combine the above aggregations, and drop the extra index level
agged = pd.concat([aggmin, aggmax, aggmean], axis=1).reset_index().drop('level_1', axis=1)
# Bring in the visit ids, which are guaranteed to be in the correct sort order
agged['visit_id'] = df['visit_id']
# Rearrange columns
agged = agged[['id', 'visit_id', 'value_min', 'value_max', 'value_mean']]
agged
id visit_id value_min value_max value_mean
0 A asd -343.68 -343.68 -343.68
1 B awd 343.68 343.68 343.68
2 B qdw -55.20 343.68 144.24
3 C qwb 55.20 55.20 55.20
# Output of your code:
agg_users
id visit_id value_min value_max value_mean
0 A asd -343.68 -343.68 -343.68
0 B awd 343.68 343.68 343.68
0 B qdw -55.20 343.68 144.24
0 C qwb 55.20 55.20 55.20
答案 1 :(得分:0)
您要使用groupby和agg:
In [13]: res.columns = ["value_min", "value_max", "value_mean"]
In [14]: res
Out[14]:
value_min value_max value_mean
id visit_id
A asd -343.68 -343.68 -343.68
B awd 343.68 343.68 343.68
qdw -55.20 -55.20 -55.20
C qwb 55.20 55.20 55.20
In [15]: res.reset_index()
Out[15]:
id visit_id value_min value_max value_mean
0 A asd -343.68 -343.68 -343.68
1 B awd 343.68 343.68 343.68
2 B qdw -55.20 -55.20 -55.20
3 C qwb 55.20 55.20 55.20
要删除MultiIndex,您可以显式设置列:
{{1}}
获得相同的结果。