Question

如果我们运行以下代码

np.random.seed(0) 
features = ['f1','f2','f3']

df = pd.DataFrame(np.random.rand(5000,4), columns=features+['target'])
for f in features:
    df[f] = np.digitize(df[f], bins=[0.13,0.66])
    df['target'] = np.digitize(df['target'], bins=[0.5]).astype(float)

df.groupby(features)['target'].agg(['mean','count']).head(9)

我们获得每个功能集分组的平均值：

            mean    count
f1  f2  f3      
0   0   0   0.571429    7
        1   0.414634    41
        2   0.428571    28
    1   0   0.490909    55
        1   0.467337    199
        2   0.486726    113
    2   0   0.518519    27
        1   0.446281    121
        2   0.541667    72

在上表中，有些小组的观察结果太少，我想将其合并到“相邻”小组中。按一些规则分组。例如，我可能想要将组[0,0,0]与组[0,0,1]合并，因为它不超过30个观察值。我想知道是否有任何好的方法根据列值操作这些组合而不创建单独的字典？更具体地说，我可能希望从最小的计数组合并到其相邻的组（索引顺序中的下一个组），直到组的总数不超过10个。

Answer 1

一种简单的方法是在满足条件的索引上使用循环for：

df_group = df.groupby(features)['target'].agg(['mean','count'])
# Fist reset_index to get an easier manipulation
df_group = df_group.reset_index()
list_indexes = df_group[df_group['count'] <=58].index.values # put any value you want
# loop for on list_indexes
for ind in list_indexes:
    # check again your condition in case at the previous iteration 
    # merging the row has increase the count above your cirteria
    if df_group['count'].loc[ind] <= 58:
        # add the count values to the next row
        df_group['count'].loc[ind+1] = df_group['count'].loc[ind+1] + df_group['count'].loc[ind]
        # do anything you want on mean
        # drop the row
        df_group = df_group.drop(axis = 0, index = ind)
# Reindex your df
df_group = df_group.set_index(features)

groupby后将子组合并到相邻的子组中

1 个答案: