所有变量对上的快速交叉表和统计数据

时间:2015-11-02 21:54:37

标签: r data.table plyr

我正在尝试计算data.table中所有变量之间关联的度量。 (这是一个统计数据问题,但是作为一个旁边:变量是所有因素,度量是Cramér's V。)

示例数据集:

p = 50; n = 1e5; # actual dataset has p > 1e3, n > 1e5, much wider but barely longer
set.seed(1234)
obs <- as.data.table( 
         data.frame(
           cbind( matrix(sample(c(LETTERS[1:4],NA), n*(p/2), replace=TRUE),
                         nrow=n, ncol=p/2),
                  matrix(sample(c(letters[1:6],NA), n*(p/2), replace=TRUE),
                         nrow=n, ncol=p/2) ),
         stringsAsFactors=TRUE ) )

我目前正在使用split-apply-combine方法,该方法涉及通过所有索引对循环(通过plyr::adply)并为每对索引返回一行。 (我试图并行化adply但失败了。)

# Calculate Cramér's V between all variables -- my kludgey approach

pairs <- t( combn(ncol(obs), 2) ) # nx2 matrix contains indices of upper triangle of df

# library('doParallel') # I tried to parallelize -- bonus points for help here (Win 7)
# cl <- makeCluster(8)
# registerDoParallel(cl)
library('plyr')
out <- adply(pairs, 1, function(ix) {
        complete_cases <- obs[,which(complete.cases(.SD)), .SDcols=ix]
        chsq <- chisq.test(x= dcast(data = obs[complete_cases, .SD, .SDcols=ix],
                                    formula = paste( names(obs)[ix], collapse='~'), 
                                    value.var = names(obs)[ix][1], # arbitrary
                                    fun.aggregate=length)[,-1, with=FALSE] )
        return(data.table(index_1 = ix[1],
                          var_1 =  names(obs)[ix][1],
                          index_2 = ix[2],
                          var_2 =  names(obs)[ix][2],
                          cramers_v = sqrt(chsq$statistic / 
                                             (sum(chsq$observed) *
                                                (pmin(nrow(chsq$observed),
                                                      ncol(chsq$observed) ) -1  ) )
                          ) ) 
        )
      })[,-1] #}, .parallel = TRUE)[,-1] # using .parallel returns Error in do.ply(i) : 
                                       # task 1 failed - "object 'obs' not found"
out <- data.table(out) # adply won't return a data.table   
# stopCluster(cl)

我有什么选择来加快这个计算?我的挑战是将pairs上的逐行操作传递到obs中的逐列计算。我想知道是否有可能直接将J生成列对,但Force对这个data.table padawan来说还不够强大。

1 个答案:

答案 0 :(得分:1)

首先,我会选择&#39; long&#39;数据格式如下:

obs[, id := 1:n]
mobs <- melt(obs, id.vars = 'id')

接下来在数据表setkeyv(mobs, 'id')上设置密钥。

最后,迭代变量并对对进行计算:

out <- list()
for(i in 1:p) {
  vari <- paste0('X', i)
  tmp <- mobs[mobs[variable == vari]]
  nn <- tmp[!(is.na(value) | is.na(i.value)), list(i.variable = i.variable[1], nij = length(id)), keyby = list(variable, value, i.value)]
  cj <- nn[, CJ(value = value, i.value = i.value, sorted = FALSE, unique = TRUE), by = variable]
  setkeyv(cj, c('variable', 'value', 'i.value'))
  nn <- nn[cj]
  nn[is.na(nij), nij := 0]
  nn[, ni := sum(nij), by = list(variable, i.value)]
  nn[, nj := sum(nij), by = list(variable, value)]
  nn[, c('n', 'r', 'k') := list(sum(nij), length(unique(i.value)), length(unique(value))), by = variable]
  out[[i]] <- nn[, list(i.variable = vari, cramers_v = (sqrt(sum((nij - ni * nj / n) ^ 2 / (ni * nj / n)) / n[1]) / min(k[1] - 1, r[1] - 1))), by = variable]
}
out <- rbindlist(out)

所以你只需要通过变量迭代一次。如你所见,我也不会使用chisq.test并自己编写计算。