使用Pandas在Python中为具有连续和分类变量的数据集创建虚拟变量

时间:2020-08-24 17:43:11

标签: python pandas variables

我正在为员工流动进行Logistic回归。 我拥有的变量是

q3attendance           object
q4attendance           object
average_attendance     object
training_days           int64
esat                 float64
lastcompany           object
client _category      object
qual_category         object
location             object
rating                object
role                  object
band                  object
resourcegroup         object
skill                 object
status                 int64

我已经使用标记了类别变量

cat=[' bandlevel ',' resourcegroup ',' skill ',..}

我使用x=df.iloc[:,:-1]y=df.iloc[:,-1]定义x和y。

接下来,我需要创建虚拟变量。所以,我使用命令

xd = pd.get_dummies(x,drop_first='True')

在此之后,我希望连续变量保持不变,并为所有分类变量创建虚拟变量。但是,在执行命令时,我发现代码还将连续变量也视为分类变量,并最终为所有变量创建了虚拟变量。因此,任期为3年2个月,4年3个月等,将3.2和4.3都归为绝对。我最终得到了超过1500个虚拟变量,在那之后进行回归是一个挑战。

我想念什么?在使用get_dummies时是否应该特别标记类别变量?

1 个答案:

答案 0 :(得分:0)

pd.get_dummies具有可选参数columns,该参数接受您需要为其创建编码的列的列表。

例如:

df.head()
+----+------+--------------+-------------+---------------------------+----------+-----------------+
|    |   id | first_name   | last_name   | email                     | gender   | ip_address      |
|----+------+--------------+-------------+---------------------------+----------+-----------------|
|  0 |    1 | Lucine       | Krout       | lkrout0@sourceforge.net   | Female   | 199.158.46.27   |
|  1 |    2 | Sherm        | Jullian     | sjullian1@mapy.cz         | Male     | 8.97.22.209     |
|  2 |    3 | Derk         | Mulloch     | dmulloch2@china.com.cn    | Male     | 132.108.184.131 |
|  3 |    4 | Elly         | Sulley      | esulley3@com.com          | Female   | 63.177.149.251  |
|  4 |    5 | Brocky       | Jell        | bjell4@huffingtonpost.com | Male     | 152.32.40.4     |
|  5 |    6 | Harv         | Allot       | hallot5@blogtalkradio.com | Male     | 71.135.240.164  |
|  6 |    7 | Wolfie       | Stable      | wstable6@utexas.edu       | Male     | 211.31.189.141  |
|  7 |    8 | Harcourt     | Dunguy      | hdunguy7@whitehouse.gov   | Male     | 224.214.43.40   |
|  8 |    9 | Devina       | Salerg      | dsalerg8@furl.net         | Female   | 49.169.34.38    |
|  9 |   10 | Missie       | Korpal      | mkorpal9@wunderground.com | Female   | 119.115.90.232  |
+----+------+--------------+-------------+---------------------------+----------+-----------------+

然后

columns = ["gender"]
pd.get_dummies(df, columns=columns)


+----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------+
|    |   id | first_name   | last_name   | email                     | ip_address      |   gender_Female |   gender_Male |
|----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------|
|  0 |    1 | Lucine       | Krout       | lkrout0@sourceforge.net   | 199.158.46.27   |               1 |             0 |
|  1 |    2 | Sherm        | Jullian     | sjullian1@mapy.cz         | 8.97.22.209     |               0 |             1 |
|  2 |    3 | Derk         | Mulloch     | dmulloch2@china.com.cn    | 132.108.184.131 |               0 |             1 |
|  3 |    4 | Elly         | Sulley      | esulley3@com.com          | 63.177.149.251  |               1 |             0 |
|  4 |    5 | Brocky       | Jell        | bjell4@huffingtonpost.com | 152.32.40.4     |               0 |             1 |
|  5 |    6 | Harv         | Allot       | hallot5@blogtalkradio.com | 71.135.240.164  |               0 |             1 |
|  6 |    7 | Wolfie       | Stable      | wstable6@utexas.edu       | 211.31.189.141  |               0 |             1 |
|  7 |    8 | Harcourt     | Dunguy      | hdunguy7@whitehouse.gov   | 224.214.43.40   |               0 |             1 |
|  8 |    9 | Devina       | Salerg      | dsalerg8@furl.net         | 49.169.34.38    |               1 |             0 |
|  9 |   10 | Missie       | Korpal      | mkorpal9@wunderground.com | 119.115.90.232  |               1 |             0 |
+----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------+
print(tabulate())

将仅对列gender

进行编码

所有数据都是自动生成的,并不代表真实世界