我正在尝试将分类变量更改为定量变量。我正在使用get_dummies
函数,该函数应返回定量变量。
我的想法是在数据框中创建新列,并将返回的定量变量添加到这些新列中,但是当我打印出来时,输出显示了其他内容。
我的代码:
import pandas as pd
import numpy as np
df = pd.read_csv('/home/user/Documents/MOOC dataset cleaned/duplicate.csv')
df['0_to_35'],df['35_to_55'],df['greater then 55'] = pd.get_dummies(df['age_band'])
print(df['0_to_35'],df['35_to_55'],df['greater then 55'])
输出:
(0 0-35
1 0-35
2 0-35
3 0-35
4 0-35
5 0-35
6 0-35
7 0-35
8 0-35
9 0-35
10 0-35
11 0-35
12 0-35
13 0-35
14 0-35
15 0-35
16 0-35
17 0-35
18 0-35
19 0-35
20 0-35
21 0-35
22 0-35
23 0-35
24 0-35
25 0-35
26 0-35
27 0-35
28 0-35
29 0-35
...
28755 0-35
28756 0-35
28757 0-35
28758 0-35
28759 0-35
28760 0-35
28761 0-35
28762 0-35
28763 0-35
28764 0-35
28765 0-35
28766 0-35
28767 0-35
28768 0-35
28769 0-35
28770 0-35
28771 0-35
28772 0-35
28773 0-35
28774 0-35
28775 0-35
28776 0-35
28777 0-35
28778 0-35
28779 0-35
28780 0-35
28781 0-35
28782 0-35
28783 0-35
28784 0-35
Name: 0_to_35, dtype: object, 0 35-55
1 35-55
2 35-55
3 35-55
4 35-55
5 35-55
6 35-55 (0 0-35
1 0-35
2 0-35
3 0-35
4 0-35
5 0-35
6 0-35
7 0-35
8 0-35
9 0-35
10 0-35
11 0-35
12 0-35
13 0-35
14 0-35
15 0-35
16 0-35
17 0-35
18 0-35
19 0-35
20 0-35
21 0-35
22 0-35
23 0-35
24 0-35
25 0-35
26 0-35
27 0-35
28 0-35
29 0-35
...
28755 0-35
28756 0-35
28757 0-35
28758 0-35
28759 0-35
28760 0-35
28761 0-35
28762 0-35
28763 0-35
28764 0-35
28765 0-35
28766 0-35
28767 0-35
28768 0-35
28769 0-35
28770 0-35
28771 0-35
28772 0-35
28773 0-35
28774 0-35
28775 0-35
28776 0-35
28777 0-35
28778 0-35
28779 0-35
28780 0-35
28781 0-35
28782 0-35
28783 0-35
28784 0-35
Name: 0_to_35, dtype: object, 0 35-55
1 35-55
2 35-55
3 35-55
4 35-55
5 35-55
6 35-55
7 35-55
8 35-55
9 35-55
10 35-55
11 35-55
12 35-55
13 35-55
14 35-55
15 35-55
16 35-55
17 35-55
18 35-55
19 35-55
20 35-55
21 35-55
22 35-55
23 35-55
24 35-55
25 35-55
26 35-55
27 35-55
28 35-55
29 35-55
...
28755 35-55
28756 35-55
28757 35-55
28758 35-55
28759 35-55
28760 35-55
28761 35-55
28762 35-55
28763 35-55
28764 35-55
28765 35-55
28766 35-55
28767 35-55
28768 35-55
28769 35-55
28770 35-55
28771 35-55
28772 35-55
28773 35-55
28774 35-55
28775 35-55
28776 35-55
28777 35-55
28778 35-55
28779 35-55
28780 35-55
28781 35-55
28782 35-55
28783 35-55
28784 35-55
Name: 35_to_55, dtype: object, 0 55<=
1 55<=
2 55<=
3 55<=
4 55<=
5 55<=
6 55<=
7 55<=
8 55<=
9 55<=
10 55<=
11 55<=
12 55<=
13 55<=
14 55<=
15 55<=
16 55<=
17 55<=
18 55<=
19 55<=
20 55<=
21 55<=
22 55<=
23 55<=
24 55<=
25 55<=
26 55<=
27 55<=
28 55<=
29 55<=
...
28755 55<=
28756 55<=
28757 55<=
28758 55<=
28759 55<=
28760 55<=
28761 55<=
28762 55<=
28763 55<=
28764 55<=
28765 55<=
28766 55<=
28767 55<=
28768 55<=
28769 55<=
28770 55<=
28771 55<=
28772 55<=
28773 55<=
28774 55<=
28775 55<=
28776 55<=
28777 55<=
28778 55<=
28779 55<=
28780 55<=
28781 55<=
28782 55<=
28783 55<=
28784 55<=
Name: greater then 55, dtype: object)
7 35-55
8 35-55
9 35-55
10 35-55
11 35-55
12 35-55
13 35-55
14 35-55
15 35-55
16 35-55
17 35-55
18 35-55
19 35-55
20 35-55
21 35-55
22 35-55
23 35-55
24 35-55
25 35-55
26 35-55
27 35-55
28 35-55
29 35-55
...
28755 35-55
28756 35-55
28757 35-55
28758 35-55
28759 35-55
28760 35-55
28761 35-55
28762 35-55
28763 35-55
28764 35-55
28765 35-55
28766 35-55
28767 35-55
28768 35-55
28769 35-55
28770 35-55
28771 35-55
28772 35-55
28773 35-55
28774 35-55
28775 35-55
28776 35-55
28777 35-55
28778 35-55
28779 35-55
28780 35-55
28781 35-55
28782 35-55
28783 35-55
28784 35-55
Name: 35_to_55, dtype: object, 0 55<=
1 55<=
2 55<=
3 55<=
4 55<=
5 55<=
6 55<=
7 55<=
8 55<=
9 55<=
10 55<=
11 55<=
12 55<=
13 55<=
14 55<=
15 55<=
16 55<=
17 55<=
18 55<=
19 55<=
20 55<=
21 55<=
22 55<=
23 55<=
24 55<=
25 55<=
26 55<=
27 55<=
28 55<=
29 55<=
...
28755 55<=
28756 55<=
28757 55<=
28758 55<=
28759 55<=
28760 55<=
28761 55<=
28762 55<=
28763 55<=
28764 55<=
28765 55<=
28766 55<=
28767 55<=
28768 55<=
28769 55<=
28770 55<=
28771 55<=
28772 55<=
28773 55<=
28774 55<=
28775 55<=
28776 55<=
28777 55<=
28778 55<=
28779 55<=
28780 55<=
28781 55<=
28782 55<=
28783 55<=pd.get_dummies(df['age_band'])
28784 55<=
Name: greater then 55, dtype: object)
pd.get_dummies(df ['age_band'])的输出-
0-35 35-55 55<=
0 0 0 1
1 0 1 0
2 0 1 0
3 0 1 0
4 1 0 0
5 0 1 0
6 1 0 0
7 1 0 0
8 1 0 0
9 0 0 1
10 0 1 0
11 1 0 0
12 0 1 0
13 1 0 0
14 0 1 0
15 1 0 0
16 0 1 0
17 0 1 0
18 0 1 0
19 0 1 0
20 1 0 0
21 1 0 0
22 0 1 0
23 0 1 0
24 1 0 0
25 0 1 0
26 1 0 0
27 1 0 0
28 0 1 0
29 0 1 0
... ... ... ...
28755 0 1 0
28756 0 1 0
28757 1 0 0
28758 0 1 0
28759 0 1 0
28760 0 1 0
28761 0 1 0
28762 0 1 0
28763 0 1 0
28764 0 1 0
28765 0 1 0
28766 0 1 0
28767 0 1 0
28768 0 1 0
28769 1 0 0
28770 0 1 0
28771 0 1 0
28772 0 1 0
28773 1 0 0
28774 0 1 0
28775 1 0 0
28776 1 0 0
28777 1 0 0
28778 0 1 0
28779 1 0 0
28780 1 0 0
28781 0 1 0
28782 1 0 0
28783 0 1 0
28784 0 1 0
[28785 rows x 3 columns]
[Finished in 0.216s]
我不明白为什么会这样。它应该在新列中的三个变量上方。我该如何解决?
答案 0 :(得分:1)
我认为需要分配给新列名称的子集:
df[['0_to_35', '35_to_55', 'greater then 55']] = pd.get_dummies(df['age_band'])
或分配给新的DataFrame和join
:
df1 = pd.get_dummies(df['age_band'])
#set new columns names if necessary
df1.columns = ['0_to_35','35_to_55','greater then 55']
df = df.join(df1)