数据框放入箱子

时间:2017-10-08 21:57:09

标签: python pandas

从数据框架中,我正在尝试使用' mean'列将值分隔为3个bin。

for (;;) {
    Socket socket = serverSocket.accept();
    ... pass the socket to a thread from a pool of threads
}

我的箱子是

                     num_countries         mean
0         'Europe',             25   161.572326
1           'Asia',              7   607.983830
2  'North America',              3  1560.438095
3  'South America',              2   199.148901
4      'Australia',              1   218.021429
5          'Africa'              1   213.846154
6        'Oceania',              1    39.378571

结果为[-inf,-100.38831237389581,955.64239998696303,inf]

然后,当我试图将它们放入垃圾箱时,就会发生这种情况。

bins = [-np.inf, (in_order['mean'].mean()-in_order['mean'].std()), (in_order['mean'].mean()+in_order['mean'].std()), np.inf]

2 个答案:

答案 0 :(得分:2)

从您的数据开始:

print(df)
       continent  num_countries         mean
0         Europe             25   161.572326
1           Asia              7   607.983830
2  North America              3  1560.438095
3  South America              2   199.148901
4      Australia              1   218.021429
5         Africa              1   213.846154
6        Oceania              1    39.378571

我认为主要问题是您引用mean列的方式。请注意mean也是pd.DataFrame对象上的一阶函数。观察:

print(df.mean)
<bound method DataFrame.mean of ....>

如果您想访问mean列(而不是mean功能),则需要df['mean']进行操作。

s = pd.cut(in_order['mean'], bins)
print(s)
0    (-100.388, 957.642]
1    (-100.388, 957.642]
2         (957.642, inf]
3    (-100.388, 957.642]
4    (-100.388, 957.642]
5    (-100.388, 957.642]
6    (-100.388, 957.642]
Name: mean, dtype: category
Categories (3, interval[float64]): [(-inf, -100.388] < (-100.388, 957.642] < (957.642, inf]]

print(s.cat.codes)
0    1
1    1
2    2
3    1
4    1
5    1
6    1
dtype: int8

或者,你考虑过pd.qcut了吗?您可以非常简单地传递二进制数,并且您的数据将被分类为多个分位数。

s = pd.qcut(df['mean'], 4)
print(s)
0      (39.378, 180.361]
1    (413.003, 1560.438]
2    (413.003, 1560.438]
3     (180.361, 213.846]
4     (213.846, 413.003]
5     (180.361, 213.846]
6      (39.378, 180.361]
Name: mean, dtype: category
Categories (4, interval[float64]): [(39.378, 180.361] < (180.361, 213.846] < (213.846, 413.003] <
                                    (413.003, 1560.438]]

print(s.cat.codes)
0    0
1    3
2    3
3    1
4    2
5    1
6    0
dtype: int8

您的上述方法将大部分数据归为一类,因此我认为这对您来说效果会更好。

答案 1 :(得分:1)

我使用np.searchsorted

x = in_order['mean'].values
sig = x.std()
mu = x.mean()

in_order.assign(bins=np.searchsorted([mu - sig, mu + sig], x))

       continent  num_countries         mean  bins
0         Europe             25   161.572326     1
1           Asia              7   607.983830     1
2  North America              3  1560.438095     2
3  South America              2   199.148901     1
4      Australia              1   218.021429     1
5         Africa              1   213.846154     1
6        Oceania              1    39.378571     1

如果您喜欢

,我们可以使用标签
x = in_order['mean'].values
sig = x.std()
mu = x.mean()

labels = np.array(['< μ - σ', 'μ ± σ', '> μ + σ'])

in_order.assign(bins=labels[np.searchsorted([mu - sig, mu + sig], x)])
       continent  num_countries         mean     bins
0         Europe             25   161.572326    μ ± σ
1           Asia              7   607.983830    μ ± σ
2  North America              3  1560.438095  > μ + σ
3  South America              2   199.148901    μ ± σ
4      Australia              1   218.021429    μ ± σ
5         Africa              1   213.846154    μ ± σ
6        Oceania              1    39.378571    μ ± σ