我正在尝试根据分类类别对pandas DataFrame进行子集化。 (我知道你可以根据值本身进行子集化,这只是我实际需要对数据进行分区的一个不同问题的表示!)我想我错过了有关子集的一些内容,但无法找到它什么在文档中。这是一个例子:
import numpy as np
import pandas as pd
np.random.seed(9876)
# Generating random data for binning.
bin_step = 0.5
random_data = np.random.uniform(low = 0, high = 10, size = 30)
# Generating bin ranges
bin_ranges = np.arange(start = random_data.min(),
stop = random_data.max() + random_data.max()*0.1,
step = bin_step)
# Cutting the random data into predefined bins.
bins = pd.cut(random_data.tolist(),
bin_ranges,
right = True,
include_lowest = True)
# Aggregating into a pandas DataFrame
random_data_pd = pd.Series(random_data.tolist(), name = 'values')
bins_transformed = pd.Series(bins, name = 'bins')
df = pd.concat([bins_transformed, random_data_pd], axis = 1)
在对二进制位进行子集化时,例如(5.086, 5.586]
,它将返回所有False
。为什么这不是子集?
df.bins == '(5.086, 5.586]' #returns all false.
答案 0 :(得分:1)
如果我理解正确,原因是您将==
用于不同类型,pd.Interval
vs str
。请检查我的例子。
print(type(df.bins[0]))
<class 'pandas._libs.interval.Interval'>
print(df.bins)
print(df.bins == pd.Interval(5.1, 5.2))
0 (1.586, 2.086]
1 (6.086, 6.586]
2 (8.586, 9.086]
3 (7.586, 8.086]
4 (5.086, 5.586]
5 (0.585, 1.086]
6 (4.586, 5.086]
7 (1.086, 1.586]
8 (9.086, 9.586]
9 (4.586, 5.086]
10 (1.586, 2.086]
11 (1.086, 1.586]
12 (2.586, 3.086]
13 (2.586, 3.086]
14 (1.086, 1.586]
15 (8.086, 8.586]
16 (7.086, 7.586]
17 (6.586, 7.086]
18 (8.586, 9.086]
19 (7.586, 8.086]
20 (7.586, 8.086]
21 (0.585, 1.086]
22 (4.586, 5.086]
23 (9.086, 9.586]
24 (8.086, 8.586]
25 (6.586, 7.086]
26 (5.086, 5.586]
27 (6.586, 7.086]
28 (5.086, 5.586]
29 (9.086, 9.586]
Name: bins, dtype: category
Categories (19, interval[float64]): [(0.585, 1.086] < (1.086, 1.586] < (1.586, 2.086] <
(2.086, 2.586] ... (8.086, 8.586] < (8.586, 9.086] <
(9.086, 9.586] < (9.586, 10.086]]
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 True
27 False
28 True
29 False
Name: bins, dtype: bool
...子集
print(df[df.bins == pd.Interval(5.1, 5.2)])
bins values
4 (5.086, 5.586] 5.132422
26 (5.086, 5.586] 5.309666
28 (5.086, 5.586] 5.574920