基于bin

时间:2017-08-17 03:31:11

标签: python pandas

我正在尝试根据分类类别对pandas DataFrame进行子集化。 (我知道你可以根据值本身进行子集化,这只是我实际需要对数据进行分区的一个不同问题的表示!)我想我错过了有关子集的一些内容,但无法找到它什么在文档中。这是一个例子:

import numpy as np
import pandas as pd

np.random.seed(9876)

# Generating random data for binning.
bin_step = 0.5
random_data = np.random.uniform(low = 0, high = 10, size = 30)

# Generating bin ranges
bin_ranges = np.arange(start = random_data.min(), 
                           stop = random_data.max() + random_data.max()*0.1, 
                           step = bin_step)

# Cutting the random data into predefined bins.
bins = pd.cut(random_data.tolist(), 
              bin_ranges,
              right = True,
              include_lowest = True)

# Aggregating into a pandas DataFrame
random_data_pd = pd.Series(random_data.tolist(), name = 'values')
bins_transformed = pd.Series(bins, name = 'bins')

df = pd.concat([bins_transformed, random_data_pd], axis = 1)

在对二进制位进行子集化时,例如(5.086, 5.586],它将返回所有False。为什么这不是子集?

df.bins == '(5.086, 5.586]' #returns all false.

1 个答案:

答案 0 :(得分:1)

如果我理解正确,原因是您将==用于不同类型,pd.Interval vs str。请检查我的例子。

print(type(df.bins[0]))

<class 'pandas._libs.interval.Interval'>

print(df.bins)
print(df.bins == pd.Interval(5.1, 5.2))

0     (1.586, 2.086]
1     (6.086, 6.586]
2     (8.586, 9.086]
3     (7.586, 8.086]
4     (5.086, 5.586]
5     (0.585, 1.086]
6     (4.586, 5.086]
7     (1.086, 1.586]
8     (9.086, 9.586]
9     (4.586, 5.086]
10    (1.586, 2.086]
11    (1.086, 1.586]
12    (2.586, 3.086]
13    (2.586, 3.086]
14    (1.086, 1.586]
15    (8.086, 8.586]
16    (7.086, 7.586]
17    (6.586, 7.086]
18    (8.586, 9.086]
19    (7.586, 8.086]
20    (7.586, 8.086]
21    (0.585, 1.086]
22    (4.586, 5.086]
23    (9.086, 9.586]
24    (8.086, 8.586]
25    (6.586, 7.086]
26    (5.086, 5.586]
27    (6.586, 7.086]
28    (5.086, 5.586]
29    (9.086, 9.586]
Name: bins, dtype: category
Categories (19, interval[float64]): [(0.585, 1.086] < (1.086, 1.586] < (1.586, 2.086] <
                                     (2.086, 2.586] ... (8.086, 8.586] < (8.586, 9.086] <
                                     (9.086, 9.586] < (9.586, 10.086]]
0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26     True
27    False
28     True
29    False
Name: bins, dtype: bool

...子集

print(df[df.bins == pd.Interval(5.1, 5.2)])

              bins    values
4   (5.086, 5.586]  5.132422
26  (5.086, 5.586]  5.309666
28  (5.086, 5.586]  5.574920