一个热编码创建n-1个虚拟变量

时间:2017-04-24 15:05:13

标签: r data.table one-hot-encoding

为了对数据集中的因子变量进行单热编码,我在这篇文章中使用了用户“Ben”的强大功能:How to one-hot-encode factor variables with data.table?

one_hot <- function(dt, cols="auto", dropCols=TRUE, dropUnusedLevels=FALSE){
  # One-Hot-Encode unordered factors in a data.table
  # If cols = "auto", each unordered factor column in dt will be encoded. (Or specifcy a vector of column names to encode)
  # If dropCols=TRUE, the original factor columns are dropped
  # If dropUnusedLevels = TRUE, unused factor levels are dropped

  # Automatically get the unordered factor columns
  if(cols[1] == "auto") cols <- colnames(dt)[which(sapply(dt, function(x) is.factor(x) & !is.ordered(x)))]

  # Build tempDT containing and ID column and 'cols' columns
  tempDT <- dt[, cols, with=FALSE]
  tempDT[, ID := .I]
  setcolorder(tempDT, unique(c("ID", colnames(tempDT))))
  for(col in cols) set(tempDT, j=col, value=factor(paste(col, tempDT[[col]], sep="_"), levels=paste(col, levels(tempDT[[col]]), sep="_")))

  # One-hot-encode
  if(dropUnusedLevels == TRUE){
    newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = T, fun = length)
  } else{
    newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = F, fun = length)
  }

  # Combine binarized columns with the original dataset
  result <- cbind(dt, newCols[, !"ID"])

  # If dropCols = TRUE, remove the original factor columns
  if(dropCols == TRUE){
    result <- result[, !cols, with=FALSE]
  }

  return(result)
}

该函数为每个因子列的所有n个因子级别创建n个虚拟变量。但由于我想使用数据进行建模,我希望每个因子列只有n-1个虚拟变量。这是可能的吗?如果可以,我怎么能用这个函数做到这一点?

从我的角度来看,必须调整此行:

newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value,     drop = T, fun = length)

这是输入表......

   ID color   size
1:  1 black  large
2:  2 green medium
3:  3   red  small

library(data.table)
DT = setDT(structure(list(ID = 1:3, color = c("black", "green", "red"), 
    size = c("large", "medium", "small")), .Names = c("ID", "color", 
"size"), row.names = c(NA, -3L), class = "data.frame"))

...和所需的输出表:

ID color.black color.green size.large size.medium
1 1 0 1 0
2 0 1 0 1
3 0 0 0 0

1 个答案:

答案 0 :(得分:3)

这是一个执行全级虚拟化的解决方案(即创建n-1列以避免共线性):

require('caret') 
data.table(ID=DT$ID, predict(dummyVars(ID ~ ., DT, fullRank = T),DT))

这完全符合工作:

   ID colorgreen colorred sizemedium sizesmall
1:  1          0        0          0         0
2:  2          1        0          1         0
3:  3          0        1          0         1

请参阅this以获取此功能的友好演练,并?dummyVars查看所有可用选项。

另外:在评论中,OP提到需要对数百万行和数千列进行此操作,从而证明需要data.table。如果这个简单的预处理步骤对于“计算机”来说太过分了,那么恐怕建模步骤(也就是真正的交易)注定要失败。