我正在尝试对R中的以下字符数据帧进行一次热编码。
x1 <- c('')
x2 <- c('A1,A2')
x3 <- c('A2,A3,A4')
test <- as.data.frame(rbind(x1,x2,x3))
我正在尝试将数据转换为以下格式:
x1 <- c(0,0,0,0)
x2 <- c(1,1,0,0)
x3 <- c(0,1,1,1)
result <- as.data.frame(rbind(x1,x2,x3))
names(result) = c('A1','A2','A3','A4')
使用的分隔符是逗号,我可以使用以下命令在逗号上分割:
test$V1 = as.character(test$V1)
split_list = strsplit(test$V1, ",")
这给了我一个列表,这些列表不能直接强制转换为数据帧。有没有更好的方法来做到这一点。我正在尝试&#34; https://www.rdocumentation.org/packages/CatEncoders/versions/0.1.0/topics/OneHotEncoder.fit&#34;。在这种情况下,包正在传播单个列而不是多个列。
答案 0 :(得分:1)
将唯一字符串值分散到列中的自定义函数:
x1 <- c('')
x2 <- c('A1,A2')
x3 <- c('A2,A3,A4')
test <- data.frame(col1=rbind(x1,x2,x3), stringsAsFactors = F) # test$col1 is a character column
cast_variables <- function(df, variable){
df[df==""] <- "missing" #handling missingness
x <- as.character(unique(df[[variable]]))
x <- gsub(" ", "", toString(x)) #so it can split on strings like "A1,A2" and "A1, A2"
x <- unlist(strsplit(x, ","))
x <- as.character(x)
new_columns <- unique(sort(x))[-grep("missing", unique(sort(x)))]
for (i in seq_along(new_columns)){
df$temp <- NA
df$temp <- ifelse(grepl(new_columns[i], df[[variable]]), 1, 0)
colnames(df)[colnames(df) == "temp"] <- new_columns[i]
}
return(df)
}
test <- cast_variables(test, "col1")
print(test)
# col1 A1 A2 A3 A4
#x1 missing 0 0 0 0
#x2 A1,A2 1 1 0 0
#x3 A2,A3,A4 0 1 1 1
答案 1 :(得分:0)
以下是使用管道的方法:
library(dplyr)
library(tidyr)
library(reshape2)
library(data.table)
test$V1 %>%
strsplit(., ",") %>%
setNames(row.names(test)) %>%
melt(value.name = 'variable') %>%
mutate(dummy = 1) %>%
spread(key = variable, value = dummy) %>%
list(data.frame(L1 = rownames(test)[!rownames(test) %in% .[['L1']]]), .) %>%
rbindlist(., use.names = T, fill = T) %>%
mutate_all(funs(replace(., is.na(.), 0)))
# L1 A1 A2 A3 A4
# 1 x1 0 0 0 0
# 2 x2 1 1 0 0
# 3 x3 0 1 1 1