有人知道我如何更好地清理这些数据,以便对其进行逻辑回归吗?
我正在尝试对种族,工作类别等多个类别的变量进行一次热编码(如下面的示例数据集所示),但不确定如何这样做。
我打算将收入更改为1和0,因为只有2个类别,但我不能对其他类别做同样的事情。
我目前的计划是对所有列出的变量进行逻辑回归:
data <- read.csv("adult_income.csv")
mylogit <- glm(formula = income ~ age + workclass + educaitonal-num +
martial status + occupation + race + gender +
capital-gain + capital-loss + hours-per-week +
native-country, data = data, family = "binomial")
我对R还是很陌生,因此对任何菜鸟错误都深表歉意!
答案 0 :(得分:1)
R非常好,当您将一个变量包装在as.factor()函数中时,它会在内部对分类变量进行编码。 categorical variable in logistic regression in r
中已经回答了问题答案 1 :(得分:1)
使用data.table
和mltools
:
df <- as.data.table(df)
df_oh <- one_hot(df)
head(df_oh)
age education_level marital_status_Divorced marital_status_Married marital_status_Never marital_status_Widowed occupation_Admin occupation_Banking occupation_Farming occupation_Fishing occupation_Poledancing gender_Man gender_Unicorn gender_Woman hours income_<=50K income_>50K
1: 26 12 0 0 0 1 0 0 0 0 1 0 0 1 39.69357 0 1
2: 70 12 0 0 0 1 0 0 0 0 1 1 0 0 39.35318 0 1
3: 21 14 1 0 0 0 1 0 0 0 0 0 0 1 40.72573 1 0
4: 56 1 0 1 0 0 0 1 0 0 0 1 0 0 39.04525 0 1
5: 81 2 0 0 0 1 0 0 1 0 0 0 1 0 39.21665 1 0
6: 38 5 0 0 0 1 1 0 0 0 0 1 0 0 39.94481 1 0
one_hot()
的工作是获取数据表的所有因子变量(即非数字,非字符等)并将其一键式处理。它需要一个数据表(而不是一个数据帧),因为数据表提供了一些有助于灵活性和速度的功能/概念。
如果您查看?one_hot
下的文档,您会发现该函数还可以很好地处理NA
(如果这是数据中的问题)。
如有任何疑问,请随时添加评论。
# Load libraries
library(data.table)
library(mltools)
# Set seed for reproducibility
set.seed(1701)
# Create mock data frame
df <- data.frame(
age = sample(18:85, 50, replace = TRUE),
education_level = sample(1:15, 50, replace = TRUE),
marital_status = sample(c("Never", "Married", "Divorced", "Widowed"), 50, replace = TRUE),
occupation = sample(c("Admin", "Farming", "Poledancing", "Fishing", "Banking"), 50, replace = TRUE),
gender = sample(c("Man", "Woman", "Unicorn"), 50, replace = TRUE),
hours = rnorm(50, 40, 1),
income = sample(c("<=50K", ">50K"), 50, replace = TRUE))
导致:
> head(df)
age education_level marital_status occupation gender hours income
1 26 12 Widowed Poledancing Woman 39.69357 >50K
2 70 12 Widowed Poledancing Man 39.35318 >50K
3 21 14 Divorced Admin Woman 40.72573 <=50K
4 56 1 Married Banking Man 39.04525 >50K
5 81 2 Widowed Farming Unicorn 39.21665 <=50K
6 38 5 Widowed Admin Man 39.94481 <=50K
答案 2 :(得分:0)
安装库虚拟人
示例:
library(dummies)
# example data
df1 <- data.frame(id = 1:4, year = 1991:1994)
df1 <- cbind(df1, dummy(df1$year, sep = "_"))
这将生成如下的虚拟变量:
df1
# id year df1_1991 df1_1992 df1_1993 df1_1994
# 1 1 1991 1 0 0 0
# 2 2 1992 0 1 0 0
# 3 3 1993 0 0 1 0
# 4 4 1994 0 0 0 1