Question

I have a data.table with multiple categorical variables for which I would like to create contrast (or "dummy") variables along with many more numerical variables which I would like to simply pass by reference.

Example dataset:

library('data.table')
d <- data.table(1:3,          # there are lots of numerics, so I want to avoid copying
                letters[1:3], # convert these to factor then dummy variable
                10:12, 
                LETTERS[24:26])
# >d
#    V1 V2 V3 V4
# 1:  1  a 10  X
# 2:  2  b 11  Y
# 3:  3  c 12  Z

The desired result looks like:

>dummyDT(d)
    V1 V3 V2.b V2.c V4.Y V4.Z
 1:  1 10    0    0    0    0
 2:  2 11    1    0    1    0
 3:  3 12    0    1    0    1

which can be produced with:

# this does what I want but is slow and inelegant and not idiomatic data.table
categorToMatrix <- function(x, name_prefix='Var'){
  # set levels in order of appearance to avoid default re-sort by alpha
  m <- contrasts(factor(x, levels=unique(x))) 
  dimnames(m) <- list(NULL, paste(name_prefix, colnames(m), sep='.') )
  m
}
dummyDT <- function(d){
  toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
  if(length(toDummy)>0){
    dummyComponent <- 
      data.table(
        do.call(cbind, lapply(toDummy, function(j) {
            categorToMatrix(d[[j]], name_prefix = names(d)[j])
          } )
         )
      )
    asIs <- (1:ncol(d))[-toDummy]
    if(length(asIs)>0) {
      allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
    } else allCols <- dummyComponent
  } else allCols <- d
  return(allCols)
}

(I do not care about maintaining original column ordering.)

I have tried in addition to the above, the approach of splitting each matrix into a list of columns, as in:

# split a matrix into list of columns and keep track of column names
# expanded from @Tommy's answer at: https://stackoverflow.com/a/6821395/2573061
splitMatrix <- function(m){
  setNames( lapply(seq_len(ncol(m)), function(j) m[,j]), colnames(m) )
}

# Example:
splitMatrix(categoricalToMatrix(d$V2, name_prefix='V2'))
# $V2.b
# [1] 0 1 0
# 
# $V2.c
# [1] 0 0 1

which works for an individual column, but then when I try to lapply to multiple columns, these lists get somehow coerced into string-rows and recycled, which is baffling me:

dummyDT2 <- function(d){
  stopifnot(inherits(d,'data.table'))

  toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))

  if(length(toDummy)>0){
     dummyComponent <- d[, lapply(.SD, function(x) splitMatrix( categorToMatrix(x) ) ) , 
                                  .SDcols=isChar]
     asIs <- (1:ncol(d))[-toDummy]
     if(length(asIs)>0) {
        allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
     } else allCols <- dummyComponent
  } else allCols <- d
  return(allCols)
}
dummyDT2(d)
# V1 V3    V2
# 1:  1 10 0,1,0
# 2:  2 11 0,0,1
# 3:  3 12 0,1,0
# Warning message:
#   In data.table::data.table(...) :
#   Item 2 is of size 2 but maximum size is 3 (recycled leaving remainder of 1 items)

I then tried wrapping splitMatrix with data.table() and got an amusingly laconic error message.

I know that functions like caret::dummyVars exist for data.frame. I am trying to create a data.table optimized version.

Closely related question: How to one-hot-encode factor variables with data.table?

But there are two differences: I do not want full-rank dummy variables (because I'm using this for regression) but rather contrast variables (n-1 of these for n levels) and I have multiple numeric variables that I do not want to OHE.

Return multiple results of column-to-matrix operations within a data.table

0 个答案: