Question

我想从一个数据框中创建N个组，但考虑到这些组的值应接近平均值。

这是我的数据框的头部：

cluster_map

，具有61行。我希望“群集”列中每个组的平均值与其余组相似。

我试图用以下方法分割数据框：

df_out = np.array_split(cluster_map, 14)

但是我在输出中得到了

df_out = np.array_split(cluster_map, 14)
print df_out[0]['cluster'].mean()
print df_out[1]['cluster'].mean()
print df_out[2]['cluster'].mean()
print df_out[3]['cluster'].mean()
print df_out[4]['cluster'].mean()
print df_out[5]['cluster'].mean()
print df_out[6]['cluster'].mean()
print df_out[7]['cluster'].mean()
print df_out[8]['cluster'].mean()
print df_out[9]['cluster'].mean()
print df_out[10]['cluster'].mean()
print df_out[11]['cluster'].mean()
print df_out[12]['cluster'].mean()
print df_out[13]['cluster'].mean()

[Out]
    1.2
    1.6
    1.4
    1.0
    1.2
    1.5
    3.75
    0.5
    1.25
    2.0
    1.0
    2.25
    1.0
    1.0

“群集”列的均值不平衡的地方。我希望这些值彼此尽可能接近，并尝试在每个组中使用相似数量的元素。

是否有任何方法可以在数据帧上执行此操作？

谢谢：）

Answer 1

这看起来类似于分层拆分，但是您需要14次拆分。试试吧！

from sklearn.model_selection import StratifiedKFold

kf = StratifiedKFold(n_splits=14)

cluster_map['group_id'] = 0
group_id =0

for _, test_index in kf.split(cluster_map,cluster['cluster']):
    cluster_map.loc[test_index,'group_id'] = group_id
    group_id += 1

考虑到组应该通过其值的平均值来平衡，是否有一种方法可以从数据框中创建组？

1 个答案: