我正在使用两个pandas数据框。我试图将年龄放入年龄组,例如,24岁将是年龄组2)。但是,我不断收到有关不同长度的错误。
以下是我正在尝试的扩展示例:
Actual Data (df)
Age
23
34
35
45
67
Age Group Definition Table (defT)
AgeGrp MinAge MaxAge
1 14 18
2 19 21
3 22 24
4 25 34
5 35 44
6 45 54
7 55 65
FINAL Data Frame
Age AgeGrp
23 3
34 4
35 5
45 6
65 7
以下是我尝试的代码:
df['AgeGrp'] = 0
for index, row in data.iterrows():
if ((data.Age >= defT.MinAge) & (data.Age <= defT.MaxAge)):
data.AgeGrp = defT.AgeGrp
提前感谢您,因为此过程将用于300,000行。
答案 0 :(得分:1)
您可以使用pd.cut()
。如果是300,000行,则需要不到10毫秒。
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randint(14, 65, 300000), columns=['Age'])
Out[74]:
Age
0 58
1 61
2 14
3 17
4 17
5 53
6 23
7 33
8 35
9 64
10 50
11 37
12 20
13 38
14 38
... ...
299985 34
299986 50
299987 36
299988 35
299989 53
299990 18
299991 34
299992 32
299993 42
299994 24
299995 28
299996 16
299997 33
299998 19
299999 52
[300000 rows x 1 columns]
defT = pd.DataFrame(dict(AgeGrp=[1,2,3,4,5,6,7], MinAge=[14,19,22,25,35,45,55], MaxAge=[18,21,24,34,44,54,65]))
Out[76]:
AgeGrp MaxAge MinAge
0 1 18 14
1 2 21 19
2 3 24 22
3 4 34 25
4 5 44 35
5 6 54 45
6 7 65 55
# the age cut-off points
cutoff = np.hstack([np.array(defT.MinAge[0]), defT.MaxAge.values])
labels = defT.AgeGrp.values
# use pd.cut
df['Age_Group'] = pd.cut(df.Age, bins=cutoff, labels=labels, right=True, include_lowest=True)
%time df['Age_Group'] = pd.cut(df.Age, bins=cutoff, labels=labels, right=True, include_lowest=True)
CPU times: user 9.33 ms, sys: 111 µs, total: 9.44 ms
Wall time: 9.36 ms
Out[78]:
Age Age_Group
0 58 7
1 61 7
2 14 1
3 17 1
4 17 1
5 53 6
6 23 3
7 33 4
8 35 5
9 64 7
10 50 6
11 37 5
12 20 2
13 38 5
14 38 5
... ... ...
299985 34 4
299986 50 6
299987 36 5
299988 35 5
299989 53 6
299990 18 1
299991 34 4
299992 32 4
299993 42 5
299994 24 3
299995 28 4
299996 16 1
299997 33 4
299998 19 2
299999 52 6
[300000 rows x 2 columns]