如何在R中使用几个分类变量对数据集进行一次热编码?

时间:2018-10-22 17:38:10

标签: r analytics data-science one-hot-encoding

有人知道我如何更好地清理这些数据,以便对其进行逻辑回归吗?

我正在尝试对种族,工作类别等多个类别的变量进行一次热编码(如下面的示例数据集所示),但不确定如何这样做。

我打算将收入更改为1和0,因为只有2个类别,但我不能对其他类别做同样的事情。

我目前的计划是对所有列出的变量进行逻辑回归:

data <- read.csv("adult_income.csv")
mylogit <- glm(formula = income ~ age + workclass + educaitonal-num + 
                   martial status + occupation + race + gender + 
                   capital-gain + capital-loss + hours-per-week + 
                   native-country, data = data, family = "binomial")

样本数据集: 1

我对R还是很陌生,因此对任何菜鸟错误都深表歉意!

3 个答案:

答案 0 :(得分:1)

R非常好,当您将一个变量包装在as.factor()函数中时,它会在内部对分类变量进行编码。 categorical variable in logistic regression in r

中已经回答了问题

答案 1 :(得分:1)

使用data.tablemltools

df <- as.data.table(df)
df_oh <- one_hot(df)

结果与解释

head(df_oh)
   age education_level marital_status_Divorced marital_status_Married marital_status_Never marital_status_Widowed occupation_Admin occupation_Banking occupation_Farming occupation_Fishing occupation_Poledancing gender_Man gender_Unicorn gender_Woman    hours income_<=50K income_>50K
1:  26              12                       0                      0                    0                      1                0                  0                  0                  0                      1          0              0            1 39.69357            0           1
2:  70              12                       0                      0                    0                      1                0                  0                  0                  0                      1          1              0            0 39.35318            0           1
3:  21              14                       1                      0                    0                      0                1                  0                  0                  0                      0          0              0            1 40.72573            1           0
4:  56               1                       0                      1                    0                      0                0                  1                  0                  0                      0          1              0            0 39.04525            0           1
5:  81               2                       0                      0                    0                      1                0                  0                  1                  0                      0          0              1            0 39.21665            1           0
6:  38               5                       0                      0                    0                      1                1                  0                  0                  0                      0          1              0            0 39.94481            1           0

one_hot()的工作是获取数据表的所有因子变量(即非数字,非字符等)并将其一键式处理。它需要一个数据表(而不是一个数据帧),因为数据表提供了一些有助于灵活性和速度的功能/概念。

如果您查看?one_hot下的文档,您会发现该函数还可以很好地处理NA(如果这是数据中的问题)。

如有任何疑问,请随时添加评论。

复制

# Load libraries
library(data.table)
library(mltools)

# Set seed for reproducibility
set.seed(1701)

# Create mock data frame
df <- data.frame(
    age = sample(18:85, 50, replace = TRUE),
    education_level = sample(1:15, 50, replace = TRUE),
    marital_status = sample(c("Never", "Married", "Divorced", "Widowed"), 50, replace = TRUE),
    occupation = sample(c("Admin", "Farming", "Poledancing", "Fishing", "Banking"), 50, replace = TRUE),
    gender = sample(c("Man", "Woman", "Unicorn"), 50, replace = TRUE),
    hours = rnorm(50, 40, 1),
    income = sample(c("<=50K", ">50K"), 50, replace = TRUE))
导致:
> head(df)
  age education_level marital_status  occupation  gender    hours income
1  26              12        Widowed Poledancing   Woman 39.69357   >50K
2  70              12        Widowed Poledancing     Man 39.35318   >50K
3  21              14       Divorced       Admin   Woman 40.72573  <=50K
4  56               1        Married     Banking     Man 39.04525   >50K
5  81               2        Widowed     Farming Unicorn 39.21665  <=50K
6  38               5        Widowed       Admin     Man 39.94481  <=50K

答案 2 :(得分:0)

安装库虚拟人

示例:

library(dummies)
# example data 
df1 <- data.frame(id = 1:4, year = 1991:1994)
df1 <- cbind(df1, dummy(df1$year, sep = "_"))

这将生成如下的虚拟变量:

df1
#   id year df1_1991 df1_1992 df1_1993 df1_1994
# 1  1 1991        1        0        0        0
# 2  2 1992        0        1        0        0
# 3  3 1993        0        0        1        0
# 4  4 1994        0        0        0        1