我正在为员工流动进行Logistic回归。 我拥有的变量是
q3attendance object
q4attendance object
average_attendance object
training_days int64
esat float64
lastcompany object
client _category object
qual_category object
location object
rating object
role object
band object
resourcegroup object
skill object
status int64
我已经使用标记了类别变量
cat=[' bandlevel ',' resourcegroup ',' skill ',..}
我使用x=df.iloc[:,:-1]
和y=df.iloc[:,-1]
定义x和y。
接下来,我需要创建虚拟变量。所以,我使用命令
xd = pd.get_dummies(x,drop_first='True')
在此之后,我希望连续变量保持不变,并为所有分类变量创建虚拟变量。但是,在执行命令时,我发现代码还将连续变量也视为分类变量,并最终为所有变量创建了虚拟变量。因此,任期为3年2个月,4年3个月等,将3.2和4.3都归为绝对。我最终得到了超过1500个虚拟变量,在那之后进行回归是一个挑战。
我想念什么?在使用get_dummies
时是否应该特别标记类别变量?
答案 0 :(得分:0)
pd.get_dummies具有可选参数columns
,该参数接受您需要为其创建编码的列的列表。
例如:
df.head()
+----+------+--------------+-------------+---------------------------+----------+-----------------+
| | id | first_name | last_name | email | gender | ip_address |
|----+------+--------------+-------------+---------------------------+----------+-----------------|
| 0 | 1 | Lucine | Krout | lkrout0@sourceforge.net | Female | 199.158.46.27 |
| 1 | 2 | Sherm | Jullian | sjullian1@mapy.cz | Male | 8.97.22.209 |
| 2 | 3 | Derk | Mulloch | dmulloch2@china.com.cn | Male | 132.108.184.131 |
| 3 | 4 | Elly | Sulley | esulley3@com.com | Female | 63.177.149.251 |
| 4 | 5 | Brocky | Jell | bjell4@huffingtonpost.com | Male | 152.32.40.4 |
| 5 | 6 | Harv | Allot | hallot5@blogtalkradio.com | Male | 71.135.240.164 |
| 6 | 7 | Wolfie | Stable | wstable6@utexas.edu | Male | 211.31.189.141 |
| 7 | 8 | Harcourt | Dunguy | hdunguy7@whitehouse.gov | Male | 224.214.43.40 |
| 8 | 9 | Devina | Salerg | dsalerg8@furl.net | Female | 49.169.34.38 |
| 9 | 10 | Missie | Korpal | mkorpal9@wunderground.com | Female | 119.115.90.232 |
+----+------+--------------+-------------+---------------------------+----------+-----------------+
然后
columns = ["gender"]
pd.get_dummies(df, columns=columns)
+----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------+
| | id | first_name | last_name | email | ip_address | gender_Female | gender_Male |
|----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------|
| 0 | 1 | Lucine | Krout | lkrout0@sourceforge.net | 199.158.46.27 | 1 | 0 |
| 1 | 2 | Sherm | Jullian | sjullian1@mapy.cz | 8.97.22.209 | 0 | 1 |
| 2 | 3 | Derk | Mulloch | dmulloch2@china.com.cn | 132.108.184.131 | 0 | 1 |
| 3 | 4 | Elly | Sulley | esulley3@com.com | 63.177.149.251 | 1 | 0 |
| 4 | 5 | Brocky | Jell | bjell4@huffingtonpost.com | 152.32.40.4 | 0 | 1 |
| 5 | 6 | Harv | Allot | hallot5@blogtalkradio.com | 71.135.240.164 | 0 | 1 |
| 6 | 7 | Wolfie | Stable | wstable6@utexas.edu | 211.31.189.141 | 0 | 1 |
| 7 | 8 | Harcourt | Dunguy | hdunguy7@whitehouse.gov | 224.214.43.40 | 0 | 1 |
| 8 | 9 | Devina | Salerg | dsalerg8@furl.net | 49.169.34.38 | 1 | 0 |
| 9 | 10 | Missie | Korpal | mkorpal9@wunderground.com | 119.115.90.232 | 1 | 0 |
+----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------+
print(tabulate())
将仅对列gender
所有数据都是自动生成的,并不代表真实世界