如果我们运行以下代码
np.random.seed(0)
features = ['f1','f2','f3']
df = pd.DataFrame(np.random.rand(5000,4), columns=features+['target'])
for f in features:
df[f] = np.digitize(df[f], bins=[0.13,0.66])
df['target'] = np.digitize(df['target'], bins=[0.5]).astype(float)
df.groupby(features)['target'].agg(['mean','count']).head(9)
我们获得每个功能集分组的平均值:
mean count
f1 f2 f3
0 0 0 0.571429 7
1 0.414634 41
2 0.428571 28
1 0 0.490909 55
1 0.467337 199
2 0.486726 113
2 0 0.518519 27
1 0.446281 121
2 0.541667 72
在上表中,有些小组的观察结果太少,我想将其合并到“相邻”小组中。按一些规则分组。例如,我可能想要将组[0,0,0]与组[0,0,1]合并,因为它不超过30个观察值。我想知道是否有任何好的方法根据列值操作这些组合而不创建单独的字典?更具体地说,我可能希望从最小的计数组合并到其相邻的组(索引顺序中的下一个组),直到组的总数不超过10个。
答案 0 :(得分:1)
一种简单的方法是在满足条件的索引上使用循环for
:
df_group = df.groupby(features)['target'].agg(['mean','count'])
# Fist reset_index to get an easier manipulation
df_group = df_group.reset_index()
list_indexes = df_group[df_group['count'] <=58].index.values # put any value you want
# loop for on list_indexes
for ind in list_indexes:
# check again your condition in case at the previous iteration
# merging the row has increase the count above your cirteria
if df_group['count'].loc[ind] <= 58:
# add the count values to the next row
df_group['count'].loc[ind+1] = df_group['count'].loc[ind+1] + df_group['count'].loc[ind]
# do anything you want on mean
# drop the row
df_group = df_group.drop(axis = 0, index = ind)
# Reindex your df
df_group = df_group.set_index(features)