我有一个数据框,每行包含逗号分隔的多个因子。因素数量和每行中的因子数量未知。我需要对这个列进行一次热编码,这样每个独特的因子都会占据自己的列。我有一个解决方案,但我相信有一个更好,更优雅的解决方案。这是一个例子:
#one hot encode multiple factors in each row
library(stringr)
library(caret)
library(splitstackshape)
#create toy data frame
set.seed(123)
factor.num <- sample(3:6,1) #how many factors in each row
factors <- letters[sample(1:26,4)]
df1 <- data.frame(fact = replicate(100,paste(sample(factors,sample(1:factor.num,1)),collapse = ", ")))
df1
#split "fact" into uknown number of columns
df1_split <- cSplit(df1,"fact",",")
# convert all columns into dummy columns
dmy <- dummyVars(" ~ .", data = df1_split)
trsf <- data.frame(predict(dmy, newdata = df1_split))
#collect all columns with unique factors
final_df <- as.data.frame(matrix(0, ncol = factor.num, nrow = 100))
colnames(final_df) <- paste0("all_",factors)
for (i in 1:factor.num) {
fac_cols <- colnames(trsf)[str_detect(colnames(trsf),paste0("(?<=\\.)",factors[i],"$"))]
final_df[,paste0("all_",factors[i])] <- apply(trsf[,fac_cols],1,function(x) as.numeric(any(x==1,na.rm=T)))
}
final_df
答案 0 :(得分:2)
以下是4行代码中没有循环的解决方案。它可能过于特定于您的示例数据,可能需要调整以获得更多通用数据。
# Get the values to apply a function over
values <- unique(trimws(unlist(strsplit(as.character(df1$fact), ','))))
# lapply over an anonymous function to return 0, 1 for presence of character
final_list <- lapply(values, function(x) as.integer(grepl(x,
as.character(df1$fact))))
# Format into data.frame
final_df2 <- as.data.frame(list_out)
colnames(final_df2) <- paste0('all_', values)
# Check to make sure the results are the same
diff_df <- final_df - final_df2[, names(final_df)]
summary(diff_df)
all_u all_k all_v all_x
Min. :0 Min. :0 Min. :0 Min. :0
1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
Median :0 Median :0 Median :0 Median :0
Mean :0 Mean :0 Mean :0 Mean :0
3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
Max. :0 Max. :0 Max. :0 Max. :0