我有一个数字,一个计数和一个过分散的计数大矩阵:
set.seed(1)
numeric.mat <- matrix(rnorm(10000*6000),10000,6000)
count.mat <- matrix(rpois(10000*6000,10),10000,6000)
dispersed.count.mat <- matrix(rnegbin(10000*6000,10,2),10000,6000)
还有一个相应的因素data.frame
(也可以是matrix
):
factors.df <- data.frame(f1 = sample(LETTERS[1:3], 10000, replace = T),
f2 = sample(LETTERS[4:5], 10000, replace = T))
因子的数量非常少(在这种情况下只有2但实际数据不会超过5),每个因素的数量(它们都是绝对的)也很小(也是5)。
我想获得residuals
,以便将linear
,poisson
和negative binomial
回归模型分别拟合到每个矩阵的每个列中
所以对于一个专栏:
data.df <- factors.df %>%
dplyr::mutate(numeric.y = numeric.mat[,1],
count.y = count.mat[,1],
dispersed.count.y = dispersed.count.mat[,1])
我会用:
lm(numeric.y ~ f1+f2, data = data.df)$residuals
residuals(object = glm(count.y ~ f1+f2, data = data.df, family = "poisson"), type = 'pearson')
residuals(object = glm.nb(formula = model.formula, data = regression.df), type = 'pearson')
对于三种回归模型。
除了(例如,使用do.call)之外,是否有更快的方法来获取这些残差。 E.g:
do.call(cbind,
lapply(1:ncol(numeric.mat),
function(i)
lm(numeric.y ~ f1+f2,
data = dplyr::mutate(factors.df,
numeric.y = numeric.mat[,i])
)$residuals
))
答案 0 :(得分:3)
我稍微调整了工作流程的运行方式,并允许它轻松地并行运行。
# Use variables to adjust models, makes it easier to change sizes
iter <- 60
iter_samps <- 1000
factors_df <- data.frame(f1 = sample(LETTERS[1:3], iter_samps, replace = T),
f2 = sample(LETTERS[4:5], iter_samps, replace = T))
# using a data.frame in a longer format to hold the data, allows easier splitting
data_df <- rep(list(factors_df), iter) %>%
bind_rows(.id = "id") %>%
mutate(numeric_y = rnorm(iter_samps * iter),
count_y = rpois(iter_samps * iter, 10),
dispersed_count_y = MASS::rnegbin(iter_samps * iter, 10, 2))
# creating function that determines residuals
model_residuals <- function(data) {
data$lm_resid <- lm(numeric_y ~ f1+f2, data = data)$residuals
data$glm_resid <- residuals(object = glm(count_y ~ f1+f2, data = data, family = "poisson"), type = 'pearson')
return(data)
}
# How to run the models not in parallel
data_df %>%
split(.$id) %>%
map(model_residuals) %>%
bind_rows()
要并行运行模型,您可以使用multidplyr
执行所有烦人的工作
library("multidplyr")
test = data_df %>%
partition(id) %>%
cluster_library("tidyverse") %>%
cluster_library("MASS") %>%
cluster_assign_value("model_residuals", model_residuals) %>%
do(results = model_residuals(.)) %>%
collect() %>%
.$results %>%
bind_rows()