从许多响应中获得线性,泊松和负二项式回归的残差的最快方法

时间:2017-11-08 02:48:32

标签: r regression apply glm lm

我有一个数字,一个计数和一个过分散的计数大矩阵:

set.seed(1)
numeric.mat <- matrix(rnorm(10000*6000),10000,6000)
count.mat <- matrix(rpois(10000*6000,10),10000,6000)
dispersed.count.mat <- matrix(rnegbin(10000*6000,10,2),10000,6000)

还有一个相应的因素data.frame(也可以是matrix):

factors.df <- data.frame(f1 = sample(LETTERS[1:3], 10000, replace = T), 
                         f2 = sample(LETTERS[4:5], 10000, replace = T))

因子的数量非常少(在这种情况下只有2但实际数据不会超过5),每个因素的数量(它们都是绝对的)也很小(也是5)。

我想获得residuals,以便将linearpoissonnegative binomial回归模型分别拟合到每个矩阵的每个列中

所以对于一个专栏:

data.df <- factors.df %>% 
    dplyr::mutate(numeric.y = numeric.mat[,1], 
                  count.y = count.mat[,1], 
                  dispersed.count.y = dispersed.count.mat[,1])

我会用:

lm(numeric.y ~ f1+f2, data = data.df)$residuals
residuals(object = glm(count.y ~ f1+f2, data = data.df, family = "poisson"), type = 'pearson')
residuals(object = glm.nb(formula = model.formula, data = regression.df), type = 'pearson')

对于三种回归模型。

除了(例如,使用do.call)之外,是否有更快的方法来获取这些残差。 E.g:

do.call(cbind, 
        lapply(1:ncol(numeric.mat), 
               function(i)
                   lm(numeric.y ~ f1+f2, 
                      data = dplyr::mutate(factors.df, 
                                           numeric.y = numeric.mat[,i])
                   )$residuals
))

1 个答案:

答案 0 :(得分:3)

我稍微调整了工作流程的运行方式,并允许它轻松地并行运行。

# Use variables to adjust models, makes it easier to change sizes
iter <- 60
iter_samps <- 1000

factors_df <- data.frame(f1 = sample(LETTERS[1:3], iter_samps, replace = T), 
                         f2 = sample(LETTERS[4:5], iter_samps, replace = T)) 

# using a data.frame in a longer format to hold the data, allows easier splitting
data_df <- rep(list(factors_df), iter) %>% 
  bind_rows(.id = "id") %>%
  mutate(numeric_y = rnorm(iter_samps * iter),
         count_y = rpois(iter_samps * iter, 10),
         dispersed_count_y = MASS::rnegbin(iter_samps * iter, 10, 2))

# creating function that determines residuals
model_residuals <- function(data) {
  data$lm_resid <- lm(numeric_y ~ f1+f2, data = data)$residuals
  data$glm_resid <- residuals(object = glm(count_y ~ f1+f2, data = data, family = "poisson"), type = 'pearson')
  return(data)
}
# How to run the models not in parallel 
data_df %>%
  split(.$id) %>%
  map(model_residuals) %>%
  bind_rows()

要并行运行模型,您可以使用multidplyr执行所有烦人的工作

library("multidplyr")
test = data_df %>%
  partition(id) %>%
  cluster_library("tidyverse") %>%
  cluster_library("MASS") %>%
  cluster_assign_value("model_residuals", model_residuals) %>%
  do(results = model_residuals(.)) %>%
  collect() %>%
  .$results %>%
  bind_rows()