我有一个数据集,其中我的所有数据都是分类的,我想使用一个热编码进行进一步分析。
我想解决的主要问题:
有3个标题的数据年龄,信息&目标
mydf <- structure(list(Age = c(99L, 10L, 40L, 15L), Info = c("c(\"good\", \"bad\", \"sad\"",
"c(\"nice\", \"happy\", \"joy\"", "NULL", "c(\"okay\", \"nice\", \"fun\", \"wild\", \"go\""
), Target = c("Boy", "Girl", "Boy", "Boy")), .Names = c("Age",
"Info", "Target"), row.names = c(NA, 4L), class = "data.frame")
我想为上面显示的所有这些变量创建一个热编码,所以它将如下所示:
Age_99 Age_10 Age_40 Age_15 good bad sad nice happy joy null okay nice fun wild go Boy Girl
1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1
答案 0 :(得分:2)
我认为以下内容应该有效:
library(splitstackshape)
library(magrittr)
suppressWarnings({ ## Just to silence melt
mydf %>% ## The dataset
as.data.table(keep.rownames = TRUE) %>% ## Convert to data.table
.[, Info := gsub("c\\(|\"", "", Info)] %>% ## Strip out c( and quotes
cSplit("Info", ",") %>% ## Split the "Info" column
melt(id.vars = "rn") %>% ## Melt everyting except rn
dcast(rn ~ value, fun.aggregate = length) ## Go wide
})
# rn 10 15 40 99 Boy Girl NULL bad fun go good happy joy nice okay sad wild NA
# 1: 1 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 2
# 2: 2 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 2
# 3: 3 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 4
# 4: 4 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0
这是我使用的示例数据:
mydf <- structure(list(Age = c(99L, 10L, 40L, 15L), Info = c("c(\"good\", \"bad\", \"sad\"",
"c(\"nice\", \"happy\", \"joy\"", "NULL", "c(\"okay\", \"nice\", \"fun\", \"wild\", \"go\""
), Target = c("Boy", "Girl", "Boy", "Boy")), .Names = c("Age",
"Info", "Target"), row.names = c(NA, 4L), class = "data.frame")
答案 1 :(得分:0)
您可以使用grepl
函数扫描每个字符串以查找您要查找的内容,并使用ifelse
相应地填充列。
类似的东西:
# This will create a new column labeled 'good' with 1 if the string contains and 0 if not
data$good = ifelse(grepl("good",data$info),1, 0)
# and do this for each variable of interest
最后,如果您愿意,可以删除info
列。这样您就不必制作任何新的数据表。
data$info <- NULL
请注意,您应该更改数据&#39;无论您的数据集的实际名称是什么。 至于年龄问题,无需将其改为因子,只需使用:
data$age99 = ifelse(data$Age == 99, 1,0) # and so forth for the other ages