创建一个热编码列,同时保留其他功能

时间:2018-12-17 10:30:46

标签: r one-hot-encoding

我有以下数据:

dataset <- structure(list(id = structure(c(2L, 3L, 1L, 3L, 1L, 9L), .Label = c("215101", 
"215559", "216566", "217284", "219435", "220209", "220249", "220250", 
"225678", "225679", "225687", "225869", "228420", "228435", "230621", 
"230623", "233063", "233097", "233098", "235546", "235560", "235567", 
"236379"), class = "factor"), cat1 = c("A", "B", "B", "A", "A", 
"A"), cat2 = c("item 1", "item 1", "item 2", "item 5", "item 3", 
"item 28"), cat3 = c("theme 2", "theme 2", "theme 1", "theme 4", 
"theme 10", "theme 40")), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))

我想创建一种具有从列cat2cat3创建的热编码列特征的模型矩阵。因此,我的输出将如下所示:

structure(list(id = structure(c(1L, 1L, 2L, 3L, 3L, 9L), .Label = c("215101", 
"215559", "216566", "217284", "219435", "220209", "220249", "220250", 
"225678", "225679", "225687", "225869", "228420", "228435", "230621", 
"230623", "233063", "233097", "233098", "235546", "235560", "235567", 
"236379"), class = "factor"), cat1 = c("A", "B", "A", "A", "B", 
"A"), `item 1` = c(0, 0, 1, 0, 1, 0), `item 2` = c(0, 1, 0, 0, 
0, 0), `item 28` = c(0, 0, 0, 0, 0, 1), `item 3` = c(1, 0, 0, 
0, 0, 0), `item 5` = c(0, 0, 0, 1, 0, 0), `theme 1` = c(0, 1, 
0, 0, 0, 0), `theme 10` = c(1, 0, 0, 0, 0, 0), `theme 2` = c(0, 
0, 1, 0, 1, 0), `theme 4` = c(0, 0, 0, 1, 0, 0), `theme 40` = c(0, 
0, 0, 0, 0, 1)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L))

但是,我在此数据集中没有自变量,我想保留idcat1列。我该怎么办?

1 个答案:

答案 0 :(得分:1)

您可以使用employee.belongsTo(models.employee, { foreignKey: 'manager_id' }), merge两次。

dcast

如果要分布的变量更多,则可以library(reshape2) merge(dcast(dataset, id + cat1 ~ cat2, fun.aggregate = length), dcast(dataset, id + cat1 ~ cat3, fun.aggregate = length), by = c("id", "cat1")) # id cat1 item 1 item 2 item 28 item 3 item 5 theme 1 theme 10 theme 2 theme 4 theme 40 #1 215101 A 0 0 0 1 0 0 1 0 0 0 #2 215101 B 0 1 0 0 0 1 0 0 0 0 #3 215559 A 1 0 0 0 0 0 0 1 0 0 #4 216566 A 0 0 0 0 1 0 0 0 1 0 #5 216566 B 1 0 0 0 0 0 0 1 0 0 #6 225678 A 0 0 1 0 0 0 0 0 0 1 首先进行数据处理。这样可以节省您一些输入。

melt