一个热门编码每行中具有多个因子的数据帧

时间:2017-11-25 16:22:32

标签: r

我有一个数据框,每行包含逗号分隔的多个因子。因素数量和每行中的因子数量未知。我需要对这个列进行一次热编码,这样每个独特的因子都会占据自己的列。我有一个解决方案,但我相信有一个更好,更优雅的解决方案。这是一个例子:

#one hot encode multiple factors in each row
library(stringr)
library(caret)
library(splitstackshape) 

#create toy data frame
set.seed(123)
factor.num <- sample(3:6,1) #how many factors in each row
factors <- letters[sample(1:26,4)]
df1 <- data.frame(fact = replicate(100,paste(sample(factors,sample(1:factor.num,1)),collapse = ", ")))
df1
#split "fact" into uknown number of columns 
df1_split <- cSplit(df1,"fact",",")
# convert all columns into dummy columns
dmy <- dummyVars(" ~ .", data = df1_split)
trsf <- data.frame(predict(dmy, newdata = df1_split))
#collect all columns with unique factors
final_df <- as.data.frame(matrix(0, ncol = factor.num, nrow = 100))
colnames(final_df) <- paste0("all_",factors)
for (i in 1:factor.num) {
  fac_cols <- colnames(trsf)[str_detect(colnames(trsf),paste0("(?<=\\.)",factors[i],"$"))]
  final_df[,paste0("all_",factors[i])] <- apply(trsf[,fac_cols],1,function(x) as.numeric(any(x==1,na.rm=T)))
}
final_df

1 个答案:

答案 0 :(得分:2)

以下是4行代码中没有循环的解决方案。它可能过于特定于您的示例数据,可能需要调整以获得更多通用数据。

# Get the values to apply a function over
values <- unique(trimws(unlist(strsplit(as.character(df1$fact), ','))))

# lapply over an anonymous function to return 0, 1 for presence of character
final_list <- lapply(values, function(x) as.integer(grepl(x, 
as.character(df1$fact))))

# Format into data.frame
final_df2 <- as.data.frame(list_out)
colnames(final_df2) <- paste0('all_', values)

# Check to make sure the results are the same
diff_df <- final_df - final_df2[, names(final_df)]
summary(diff_df)

     all_u       all_k       all_v       all_x  
 Min.   :0   Min.   :0   Min.   :0   Min.   :0  
 1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0  
 Median :0   Median :0   Median :0   Median :0  
 Mean   :0   Mean   :0   Mean   :0   Mean   :0  
 3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0  
 Max.   :0   Max.   :0   Max.   :0   Max.   :0