将类别变量转换为python中的定量变量

时间:2018-07-11 06:22:47

标签: python pandas

我正在尝试将分类变量更改为定量变量。我正在使用get_dummies函数,该函数应返回定量变量。

我的想法是在数据框中创建新列,并将返回的定量变量添加到这些新列中,但是当我打印出来时,输出显示了其他内容。

我的代码:

    import pandas as pd
    import numpy as np

    df = pd.read_csv('/home/user/Documents/MOOC dataset cleaned/duplicate.csv')
    df['0_to_35'],df['35_to_55'],df['greater then 55'] = pd.get_dummies(df['age_band'])

    print(df['0_to_35'],df['35_to_55'],df['greater then 55'])

输出:

(0       0-35
1        0-35
2        0-35
3        0-35
4        0-35
5        0-35
6        0-35
7        0-35
8        0-35
9        0-35
10       0-35
11       0-35
12       0-35
13       0-35
14       0-35
15       0-35
16       0-35
17       0-35
18       0-35
19       0-35
20       0-35
21       0-35
22       0-35
23       0-35
24       0-35
25       0-35
26       0-35
27       0-35
28       0-35
29       0-35
         ... 
28755    0-35
28756    0-35
28757    0-35
28758    0-35
28759    0-35
28760    0-35
28761    0-35
28762    0-35
28763    0-35
28764    0-35
28765    0-35
28766    0-35
28767    0-35
28768    0-35
28769    0-35
28770    0-35
28771    0-35
28772    0-35
28773    0-35
28774    0-35
28775    0-35
28776    0-35
28777    0-35
28778    0-35
28779    0-35
28780    0-35
28781    0-35
28782    0-35
28783    0-35
28784    0-35
Name: 0_to_35, dtype: object, 0        35-55
1        35-55
2        35-55
3        35-55
4        35-55
5        35-55
6        35-55    (0        0-35
1        0-35
2        0-35
3        0-35
4        0-35
5        0-35
6        0-35
7        0-35
8        0-35
9        0-35
10       0-35
11       0-35
12       0-35
13       0-35
14       0-35
15       0-35
16       0-35
17       0-35
18       0-35
19       0-35
20       0-35
21       0-35
22       0-35
23       0-35
24       0-35
25       0-35
26       0-35
27       0-35
28       0-35
29       0-35
         ... 
28755    0-35
28756    0-35
28757    0-35
28758    0-35
28759    0-35
28760    0-35
28761    0-35
28762    0-35
28763    0-35
28764    0-35
28765    0-35
28766    0-35
28767    0-35
28768    0-35
28769    0-35
28770    0-35
28771    0-35
28772    0-35
28773    0-35
28774    0-35
28775    0-35
28776    0-35
28777    0-35
28778    0-35
28779    0-35
28780    0-35
28781    0-35
28782    0-35
28783    0-35
28784    0-35
Name: 0_to_35, dtype: object, 0        35-55
1        35-55
2        35-55
3        35-55
4        35-55
5        35-55
6        35-55
7        35-55
8        35-55
9        35-55
10       35-55
11       35-55
12       35-55
13       35-55
14       35-55
15       35-55
16       35-55
17       35-55
18       35-55
19       35-55
20       35-55
21       35-55
22       35-55
23       35-55
24       35-55
25       35-55
26       35-55
27       35-55
28       35-55
29       35-55
         ...  
28755    35-55
28756    35-55
28757    35-55
28758    35-55
28759    35-55
28760    35-55
28761    35-55
28762    35-55
28763    35-55
28764    35-55
28765    35-55
28766    35-55
28767    35-55
28768    35-55
28769    35-55
28770    35-55
28771    35-55
28772    35-55
28773    35-55
28774    35-55
28775    35-55
28776    35-55
28777    35-55
28778    35-55
28779    35-55
28780    35-55
28781    35-55
28782    35-55
28783    35-55
28784    35-55
Name: 35_to_55, dtype: object, 0        55<=
1        55<=
2        55<=
3        55<=
4        55<=
5        55<=
6        55<=
7        55<=
8        55<=
9        55<=
10       55<=
11       55<=
12       55<=
13       55<=
14       55<=
15       55<=
16       55<=
17       55<=
18       55<=
19       55<=
20       55<=
21       55<=
22       55<=
23       55<=
24       55<=
25       55<=
26       55<=
27       55<=
28       55<=
29       55<=
         ... 
28755    55<=
28756    55<=
28757    55<=
28758    55<=
28759    55<=
28760    55<=
28761    55<=
28762    55<=
28763    55<=
28764    55<=
28765    55<=
28766    55<=
28767    55<=
28768    55<=
28769    55<=
28770    55<=
28771    55<=
28772    55<=
28773    55<=
28774    55<=
28775    55<=
28776    55<=
28777    55<=
28778    55<=
28779    55<=
28780    55<=
28781    55<=
28782    55<=
28783    55<=
28784    55<=
Name: greater then 55, dtype: object)
7        35-55
8        35-55
9        35-55
10       35-55
11       35-55
12       35-55
13       35-55
14       35-55
15       35-55
16       35-55
17       35-55
18       35-55
19       35-55
20       35-55
21       35-55
22       35-55
23       35-55
24       35-55
25       35-55
26       35-55
27       35-55
28       35-55
29       35-55
         ...  
28755    35-55
28756    35-55
28757    35-55
28758    35-55
28759    35-55
28760    35-55
28761    35-55
28762    35-55
28763    35-55
28764    35-55
28765    35-55
28766    35-55
28767    35-55
28768    35-55
28769    35-55
28770    35-55
28771    35-55
28772    35-55
28773    35-55
28774    35-55
28775    35-55
28776    35-55
28777    35-55
28778    35-55
28779    35-55
28780    35-55
28781    35-55
28782    35-55
28783    35-55
28784    35-55
Name: 35_to_55, dtype: object, 0        55<=
1        55<=
2        55<=
3        55<=
4        55<=
5        55<=
6        55<=
7        55<=
8        55<=
9        55<=
10       55<=
11       55<=
12       55<=
13       55<=
14       55<=
15       55<=
16       55<=
17       55<=
18       55<=
19       55<=
20       55<=
21       55<=
22       55<=
23       55<=
24       55<=
25       55<=
26       55<=
27       55<=
28       55<=
29       55<=
         ... 
28755    55<=
28756    55<=
28757    55<=
28758    55<=
28759    55<=
28760    55<=
28761    55<=
28762    55<=
28763    55<=
28764    55<=
28765    55<=
28766    55<=
28767    55<=
28768    55<=
28769    55<=
28770    55<=
28771    55<=
28772    55<=
28773    55<=
28774    55<=
28775    55<=
28776    55<=
28777    55<=
28778    55<=
28779    55<=
28780    55<=
28781    55<=
28782    55<=
28783    55<=pd.get_dummies(df['age_band'])
28784    55<=
Name: greater then 55, dtype: object)

pd.get_dummies(df ['age_band'])的输出-

    0-35  35-55  55<=
0         0      0     1
1         0      1     0
2         0      1     0
3         0      1     0
4         1      0     0
5         0      1     0
6         1      0     0
7         1      0     0
8         1      0     0
9         0      0     1
10        0      1     0
11        1      0     0
12        0      1     0
13        1      0     0
14        0      1     0
15        1      0     0
16        0      1     0
17        0      1     0
18        0      1     0
19        0      1     0
20        1      0     0
21        1      0     0
22        0      1     0
23        0      1     0
24        1      0     0
25        0      1     0
26        1      0     0
27        1      0     0
28        0      1     0
29        0      1     0
...     ...    ...   ...
28755     0      1     0
28756     0      1     0
28757     1      0     0
28758     0      1     0
28759     0      1     0
28760     0      1     0
28761     0      1     0
28762     0      1     0
28763     0      1     0
28764     0      1     0
28765     0      1     0
28766     0      1     0
28767     0      1     0
28768     0      1     0
28769     1      0     0
28770     0      1     0
28771     0      1     0
28772     0      1     0
28773     1      0     0
28774     0      1     0
28775     1      0     0
28776     1      0     0
28777     1      0     0
28778     0      1     0
28779     1      0     0
28780     1      0     0
28781     0      1     0
28782     1      0     0
28783     0      1     0
28784     0      1     0

[28785 rows x 3 columns]
[Finished in 0.216s]

我不明白为什么会这样。它应该在新列中的三个变量上方。我该如何解决?

1 个答案:

答案 0 :(得分:1)

我认为需要分配给新列名称的子集:

df[['0_to_35', '35_to_55', 'greater then 55']] = pd.get_dummies(df['age_band'])

或分配给新的DataFrame和join

df1 = pd.get_dummies(df['age_band'])
#set new columns names if necessary
df1.columns = ['0_to_35','35_to_55','greater then 55']
df = df.join(df1)