从数据框架中,我正在尝试使用' mean'列将值分隔为3个bin。
for (;;) {
Socket socket = serverSocket.accept();
... pass the socket to a thread from a pool of threads
}
我的箱子是
num_countries mean
0 'Europe', 25 161.572326
1 'Asia', 7 607.983830
2 'North America', 3 1560.438095
3 'South America', 2 199.148901
4 'Australia', 1 218.021429
5 'Africa' 1 213.846154
6 'Oceania', 1 39.378571
结果为[-inf,-100.38831237389581,955.64239998696303,inf]
然后,当我试图将它们放入垃圾箱时,就会发生这种情况。
bins = [-np.inf, (in_order['mean'].mean()-in_order['mean'].std()), (in_order['mean'].mean()+in_order['mean'].std()), np.inf]
答案 0 :(得分:2)
从您的数据开始:
print(df)
continent num_countries mean
0 Europe 25 161.572326
1 Asia 7 607.983830
2 North America 3 1560.438095
3 South America 2 199.148901
4 Australia 1 218.021429
5 Africa 1 213.846154
6 Oceania 1 39.378571
我认为主要问题是您引用mean
列的方式。请注意mean
也是pd.DataFrame
对象上的一阶函数。观察:
print(df.mean)
<bound method DataFrame.mean of ....>
如果您想访问mean
列(而不是mean
功能),则需要df['mean']
进行操作。
s = pd.cut(in_order['mean'], bins)
print(s)
0 (-100.388, 957.642]
1 (-100.388, 957.642]
2 (957.642, inf]
3 (-100.388, 957.642]
4 (-100.388, 957.642]
5 (-100.388, 957.642]
6 (-100.388, 957.642]
Name: mean, dtype: category
Categories (3, interval[float64]): [(-inf, -100.388] < (-100.388, 957.642] < (957.642, inf]]
print(s.cat.codes)
0 1
1 1
2 2
3 1
4 1
5 1
6 1
dtype: int8
或者,你考虑过pd.qcut
了吗?您可以非常简单地传递二进制数,并且您的数据将被分类为多个分位数。
s = pd.qcut(df['mean'], 4)
print(s)
0 (39.378, 180.361]
1 (413.003, 1560.438]
2 (413.003, 1560.438]
3 (180.361, 213.846]
4 (213.846, 413.003]
5 (180.361, 213.846]
6 (39.378, 180.361]
Name: mean, dtype: category
Categories (4, interval[float64]): [(39.378, 180.361] < (180.361, 213.846] < (213.846, 413.003] <
(413.003, 1560.438]]
print(s.cat.codes)
0 0
1 3
2 3
3 1
4 2
5 1
6 0
dtype: int8
您的上述方法将大部分数据归为一类,因此我认为这对您来说效果会更好。
答案 1 :(得分:1)
我使用np.searchsorted
x = in_order['mean'].values
sig = x.std()
mu = x.mean()
in_order.assign(bins=np.searchsorted([mu - sig, mu + sig], x))
continent num_countries mean bins
0 Europe 25 161.572326 1
1 Asia 7 607.983830 1
2 North America 3 1560.438095 2
3 South America 2 199.148901 1
4 Australia 1 218.021429 1
5 Africa 1 213.846154 1
6 Oceania 1 39.378571 1
如果您喜欢
,我们可以使用标签x = in_order['mean'].values
sig = x.std()
mu = x.mean()
labels = np.array(['< μ - σ', 'μ ± σ', '> μ + σ'])
in_order.assign(bins=labels[np.searchsorted([mu - sig, mu + sig], x)])
continent num_countries mean bins
0 Europe 25 161.572326 μ ± σ
1 Asia 7 607.983830 μ ± σ
2 North America 3 1560.438095 > μ + σ
3 South America 2 199.148901 μ ± σ
4 Australia 1 218.021429 μ ± σ
5 Africa 1 213.846154 μ ± σ
6 Oceania 1 39.378571 μ ± σ