Question

我正在使用两个pandas数据框。我试图将年龄放入年龄组，例如，24岁将是年龄组2）。但是，我不断收到有关不同长度的错误。

以下是我正在尝试的扩展示例：

 Actual Data (df)
 Age
 23
 34
 35
 45
 67

 Age Group Definition Table (defT)
 AgeGrp  MinAge  MaxAge
 1       14      18
 2       19      21
 3       22      24
 4       25      34
 5       35      44  
 6       45      54
 7       55      65

 FINAL Data Frame
 Age    AgeGrp
 23     3
 34     4
 35     5
 45     6
 65     7

以下是我尝试的代码：

df['AgeGrp'] = 0
for index, row in data.iterrows():
    if ((data.Age >= defT.MinAge) & (data.Age <= defT.MaxAge)):
        data.AgeGrp = defT.AgeGrp

提前感谢您，因为此过程将用于300,000行。

Answer 1

您可以使用pd.cut()。如果是300,000行，则需要不到10毫秒。

import pandas as pd
import numpy as np

np.random.seed(0)
df = pd.DataFrame(np.random.randint(14, 65, 300000), columns=['Age'])

Out[74]: 
        Age
0        58
1        61
2        14
3        17
4        17
5        53
6        23
7        33
8        35
9        64
10       50
11       37
12       20
13       38
14       38
...     ...
299985   34
299986   50
299987   36
299988   35
299989   53
299990   18
299991   34
299992   32
299993   42
299994   24
299995   28
299996   16
299997   33
299998   19
299999   52

[300000 rows x 1 columns]

defT = pd.DataFrame(dict(AgeGrp=[1,2,3,4,5,6,7], MinAge=[14,19,22,25,35,45,55], MaxAge=[18,21,24,34,44,54,65]))

Out[76]: 
   AgeGrp  MaxAge  MinAge
0       1      18      14
1       2      21      19
2       3      24      22
3       4      34      25
4       5      44      35
5       6      54      45
6       7      65      55

# the age cut-off points
cutoff = np.hstack([np.array(defT.MinAge[0]), defT.MaxAge.values])
labels = defT.AgeGrp.values
# use pd.cut
df['Age_Group'] = pd.cut(df.Age, bins=cutoff, labels=labels, right=True, include_lowest=True)

%time df['Age_Group'] = pd.cut(df.Age, bins=cutoff, labels=labels, right=True, include_lowest=True)

CPU times: user 9.33 ms, sys: 111 µs, total: 9.44 ms
Wall time: 9.36 ms


Out[78]: 
        Age Age_Group
0        58         7
1        61         7
2        14         1
3        17         1
4        17         1
5        53         6
6        23         3
7        33         4
8        35         5
9        64         7
10       50         6
11       37         5
12       20         2
13       38         5
14       38         5
...     ...       ...
299985   34         4
299986   50         6
299987   36         5
299988   35         5
299989   53         6
299990   18         1
299991   34         4
299992   32         4
299993   42         5
299994   24         3
299995   28         4
299996   16         1
299997   33         4
299998   19         2
299999   52         6

[300000 rows x 2 columns]

Python Pandas条件语句2数据帧差异长度

1 个答案: