在python中将数据分类为具有相同间隔大小的n类别

时间:2016-04-24 00:14:40

标签: python pandas statsmodels

假设我想将以下数据分类为12类:

   no.     grades
    0      9.08
    1      8.31
    2      7.42
    3      7.42
    4      7.42
    5      7.46
    6      9.67
    7     11.77
    8      8.81
    9      6.44
    10     9.40
    11     9.06
    12    10.52
    13     6.19
    14     5.04
    15     5.04
    16     9.44
    17     5.87
    18     2.67
    19     6.99
    20     9.08
    21     6.64
    22     4.83
    23     4.47
    24     6.61
    25     6.61
    26     7.42
    27     6.42
    28    10.00
    29     9.11

可以这样做:

df.a[df.a <= 1 and df.a>0] = 1
df.a[df.a <= 2 and df.a>1] = 2
.
.
.
df.a[df.a <= 12 and df.a>11] = 12

还有其他方法可以将项目分类为具有恒定和相等间隔的类别吗?

P.S:

我的数据在这里,我想对其成绩列进行分类:

     psechoice  hscath  grades  faminc  famsiz  parcoll  female  black
0            1       0    9.08   62.50       5        0       0      0
1            1       0    8.31   42.50       4        0       1      0
2            1       0    7.42   62.50       4        0       1      0
3            1       0    7.42   62.50       4        0       1      0
4            1       0    7.42   62.50       4        0       1      0
5            1       0    7.46   12.50       2        0       1      0
6            1       0    9.67   30.00       5        0       0      0
7            0       0   11.77   42.50       4        0       0      0
8            1       0    8.81   17.50       3        0       1      0
9            1       0    6.44   42.50       6        0       0      0
10           1       0    9.40   30.00       5        1       0      0
11           1       0    9.06   62.50       6        0       0      0
12           0       0   10.52   62.50       3        0       0      0
13           1       0    6.19   62.50       2        0       1      0
14           1       0    5.04   42.50       6        0       1      0
15           1       0    5.04   42.50       6        0       1      0
16           0       0    9.44   22.50       2        0       1      0
17           1       1    5.87   87.50       5        1       0      0
18           1       1    2.67   62.50       4        0       0      0
19           1       1    6.99   42.50       5        0       0      0
20           1       1    9.08  150.00       4        1       1      0
21           1       0    6.64   42.50       9        0       1      0
22           1       1    4.83    0.50       4        1       0      0
23           1       1    4.47   62.50       3        0       1      0
24           1       1    6.61   87.50       6        1       0      0
25           1       1    6.61   87.50       6        1       0      0
26           1       1    7.42   42.50       4        1       0      0
27           1       1    6.42   87.50       5        1       0      0
28           1       0   10.00    8.75       4        1       1      0
29           1       0    9.11   22.50       3        0       0      1

1 个答案:

答案 0 :(得分:5)

您可以使用pd.cut为类别分配值:

import pandas as pd
df = pd.DataFrame(
    {'grades': [9.08, 8.31, 7.42, 7.42, 7.42, 7.46, 9.67, 11.77,
                8.81, 6.44, 9.40, 9.06, 10.52, 6.19, 5.04, 5.04, 9.44, 5.87,
                2.67, 6.99, 9.08, 6.64, 4.83, 4.47, 6.61, 6.61, 7.42, 6.42,
                10.0, 9.11],
     'no.': range(30)})
df['category'] = pd.cut(df['grades'], bins=range(0, 13), labels=range(1, 13))
print(df)

产量

     grades  no. category
0     9.08    0       10
1     8.31    1        9
2     7.42    2        8
3     7.42    3        8
4     7.42    4        8
5     7.46    5        8
6     9.67    6       10
7    11.77    7       12
...

使用pd.cut(..., bins=range(0, 13)),类别为

[(0, 1] < (1, 2] < (2, 3] < (3, 4] ... (8, 9] < (9, 10] < (10, 11] < (11, 12]]

请注意,间隔在左侧打开,在右侧打开。