此问与答源自How to make group_by and lm fast?,其中OP试图针对大数据帧按组进行简单的线性回归。
从理论上讲,一系列分组回归y ~ x | g
等效于单个合并回归y ~ x * g
。后者非常吸引人,因为不同组之间的统计检验非常简单。但是实际上,进行这种较大的回归计算并不容易。我对链接的“问答”评论包speedlm
和glm4
的回答,但指出它们不能很好地解决此问题。
大型回归问题很困难,尤其是在存在因素变量的情况下。这可以解释为什么许多人放弃了这个想法,而是更喜欢按组划分数据并按组拟合模型。对于我而言,没有必要列举逐组回归的方法(请参见Linear Regression and group by in R)。我在乎的是速度。
对于像y ~ x | g
这样的简单线性回归,可以按组划分数据,然后依靠诸如lm
之类的标准模型拟合例程来杀死性能。首先,子集大数据帧效率低下。其次,标准模型拟合例程遵循以下过程,这对于有用的回归计算而言是巨大的开销。
terms.formula
); model.frame.default
); model.matrix.default
)。对于简单的线性回归,有巧妙的计算技巧。正如我在Fast pairwise simple linear regression between variables in a data frame中演示的那样,协方差方法非常快。我们可以通过group_by_simpleLM
函数通过简单的线性回归使它适应分组吗?
答案 0 :(得分:6)
我们必须通过编写编译后的代码来做到这一点。我会与Rcpp一起提出。请注意,我是C程序员,并且一直在使用R的常规C接口。 Rcpp只是用于简化列表,字符串和属性的处理,并有助于在R中立即进行测试。该代码主要以C风格编写。 R的常规C接口(例如REAL
和INTEGER
)仍然使用宏。有关“ group_by_simpleLM.cpp”的信息,请参见此答案的底部。
R包装函数group_by_simpleLM
具有四个参数:
group_by_simpleLM <- function (dat, LHS, RHS, group) {
##.... [TRUNCATED]
dat
是一个数据帧。如果您输入矩阵或列表,它将停止并抱怨。LHS
是一个字符向量,在~
的左侧给出了变量的名称。支持多个LHS变量。RHS
是一个字符向量,在~
的右侧给出了变量的名称。在简单的线性回归中,仅允许一个非因数的RHS变量。您可以向RHS
提供变量向量,但是该函数将仅保留第一个元素(带有警告)。如果在dat
中找不到该变量(可能是因为您输入了错误的名称),或者它不是数字变量,它将为您提供信息丰富的错误消息。group
是一个字符向量,给出了分组变量的名称。最好是dat
中的一个因素,否则该函数将match(group, unique(group))
用于快速强制并发出警告。水平未使用的因素无害。 group_by_simpleLM_cpp
看到了此情况,并返回了所有NaN
的级别。 group
可以为NULL
,以便对所有数据进行一次回归。主力函数group_by_simpleLM_cpp
返回矩阵的命名列表,以保存每个响应的回归结果。每个矩阵都是“宽”的,有nlevels(group)
列和5行:
For a simple linear regression, these five statistics are sufficient to obtain other statistics。
该函数注意组中只有一个基准的秩不足情况。无法估计斜率,然后返回NaN
。另一个特殊情况是组中只有两个数据。这样,拟合就完美了,您得到的斜率标准误差为0。
该函数在nlme::lmList(RHS ~ LHS | group, dat, pool = FALSE)
时是group != NULL
的快速方法,在lm(RHS ~ LHS, dat)
时是group = NULL
的快速方法(甚至可能比general_paired_simpleLM快,它是用C编写的。
警告:
NA
中的NaN
/ Inf
/ -Inf
/ dat
,并且函数在存在时中断。library(Rcpp)
sourceCpp("group_by_simpleLM.cpp")
## a toy dataset
set.seed(0)
dat <- data.frame(y1 = rnorm(10), y2 = rnorm(10), x = 1:5,
f = gl(2, 5, labels = letters[1:2]),
g = sample(gl(2, 5, labels = LETTERS[1:2])))
分组回归:一种nlme::lmList
group_by_simpleLM(dat, c("y1", "y2"), "x", "f")
#$y1
# a b
#alpha 0.820107094 -2.7164723
#beta -0.009796302 0.8812007
#beta.se 0.266690568 0.2090644
#r2 0.000449565 0.8555330
#df.resid 3.000000000 3.0000000
#
#$y2
# a b
#alpha 0.1304709 0.06996587
#beta -0.1616069 -0.14685953
#beta.se 0.2465047 0.24815024
#r2 0.1253142 0.10454374
#df.resid 3.0000000 3.00000000
fit <- nlme::lmList(cbind(y1, y2) ~ x | f, data = dat, pool = FALSE)
## results for level "a"; use `fit[[2]]` to see results for level "b"
lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
#$`Response y1`
#$`Response y1`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.820107094 0.8845125 0.92718537 0.4222195
#x -0.009796302 0.2666906 -0.03673284 0.9730056
#
#$`Response y1`$r.squared
#[1] 0.000449565
#
#
#$`Response y2`
#$`Response y2`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1304709 0.8175638 0.1595850 0.8833471
#x -0.1616069 0.2465047 -0.6555936 0.5588755
#
#$`Response y2`$r.squared
#[1] 0.1253142
处理等级缺陷而无故障
## with unused level "b"
group_by_simpleLM(dat[1:5, ], "y1", "x", "f")
#$y1
# a b
#alpha 0.820107094 NaN
#beta -0.009796302 NaN
#beta.se 0.266690568 NaN
#r2 0.000449565 NaN
#df.resid 3.000000000 NaN
## rank-deficient case for level "b"
group_by_simpleLM(dat[1:6, ], "y1", "x", "f")
#$y1
# a b
#alpha 0.820107094 -1.53995
#beta -0.009796302 NaN
#beta.se 0.266690568 NaN
#r2 0.000449565 NaN
#df.resid 3.000000000 0.00000
多个分组变量
当我们有多个分组变量时,group_by_simpleLM
无法直接处理它们。但是您可以使用interaction
首先创建一个单因子变量。
dat$fg <- with(dat, interaction(f, g, drop = TRUE, sep = ":"))
group_by_simpleLM(dat, c("y1", "y2"), "x", "fg")
#$y1
# a:A b:A a:B b:B
#alpha 1.4750325 -2.7684583 -1.6393289 -1.8513669
#beta -0.2120782 0.9861509 0.7993313 0.4613999
#beta.se 0.0000000 0.2098876 0.4946167 0.0000000
#r2 1.0000000 0.9566642 0.7231188 1.0000000
#df.resid 0.0000000 1.0000000 1.0000000 0.0000000
#
#$y2
# a:A b:A a:B b:B
#alpha 1.0292956 -0.22746944 -1.5096975 0.06876360
#beta -0.2657021 -0.20650690 0.2547738 0.09172993
#beta.se 0.0000000 0.01945569 0.3483856 0.00000000
#r2 1.0000000 0.99120195 0.3484482 1.00000000
#df.resid 0.0000000 1.00000000 1.0000000 0.00000000
fit <- nlme::lmList(cbind(y1, y2) ~ x | fg, data = dat, pool = FALSE)
## note that the first group a:A only has two values, so df.resid = 0
## my method returns 0 standard error for the slope
## but lm or lmList would return NaN
lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
#$`Response y1`
#$`Response y1`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.4750325 NaN NaN NaN
#x -0.2120782 NaN NaN NaN
#
#$`Response y1`$r.squared
#[1] 1
#
#
#$`Response y2`
#$`Response y2`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.0292956 NaN NaN NaN
#x -0.2657021 NaN NaN NaN
#
#$`Response y2`$r.squared
#[1] 1
不分组:lm
group_by_simpleLM(dat, c("y1", "y2"), "x", NULL)
#$y1
# ALL
#alpha -0.9481826
#beta 0.4357022
#beta.se 0.2408162
#r2 0.2903691
#df.resid 8.0000000
#
#$y2
# ALL
#alpha 0.1002184
#beta -0.1542332
#beta.se 0.1514935
#r2 0.1147012
#df.resid 8.0000000
快速的大型简单线性回归
set.seed(0L)
nSubj <- 200e3
nr <- 1e6
DF <- data.frame(subject = gl(nSubj, 5),
day = 3:7,
y1 = runif(nr),
y2 = rpois(nr, 3),
y3 = rnorm(nr),
y4 = rnorm(nr, 1, 5))
system.time(group_by_simpleLM(DF, paste0("y", 1:4), "day", "subject"))
# user system elapsed
# 0.200 0.016 0.219
library(MatrixModels)
system.time(glm4(y1 ~ 0 + subject + day:subject, data = DF, sparse = TRUE))
# user system elapsed
# 9.012 0.172 9.266
group_by_simpleLM
几乎立即执行所有4个响应,而glm4
仅9个响应就需要一个响应!
请注意,glm4
在排名不足的情况下可能会崩溃,而group_by_simpleLM
不会。
附录:“ group_by_simpleLM.cpp”
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List group_by_simpleLM_cpp (List Y, NumericVector x, IntegerVector group, CharacterVector group_levels, bool group_unsorted) {
/* number of data and number of responses */
int n = x.size(), k = Y.size(), n_groups = group_levels.size();
/* set up result list */
List result(k);
List dimnames = List::create(CharacterVector::create("alpha", "beta", "beta.se", "r2", "df.resid"), group_levels);
int j; for (j = 0; j < k; j++) {
NumericMatrix mat(5, n_groups);
mat.attr("dimnames") = dimnames;
result[j] = mat;
}
result.attr("names") = Y.attr("names");
/* set up a vector to hold sample size for each group */
size_t *group_offset = (size_t *)calloc(n_groups + 1, sizeof(size_t));
/*
compute group offset: cumsum(group_offset)
The offset is used in a different way when group is sorted or unsorted
In the former case, it is the offset to real x, y values;
In the latter case, it is the offset to ordering index indx
*/
int *u = INTEGER(group), *u_end = u + n, i;
if (n_groups > 1) {
while (u < u_end) group_offset[*u++]++;
for (i = 0; i < n_groups; i++) group_offset[i + 1] += group_offset[i];
} else {
group_offset[1] = n;
group_unsorted = 0;
}
/* local variables & pointers */
double *xi, *xi_end; /* pointer to the 1st and the last x value */
double *yi; /* pointer to the first y value */
int gi; double inv_gi; /* sample size of the i-th group */
double xi_mean, xi_var; /* mean & variance of x values in the i-th group */
double yi_mean, yi_var; /* mean & variance of y values in the i-th group */
double xiyi_cov; /* covariance between x and y values in the i-th group */
double beta, r2; int df_resi;
double *matij;
/* additional storage and variables when group is unsorted */
int *indx; double *xb, *xbi, dtmp;
if (group_unsorted) {
indx = (int *)malloc(n * sizeof(int));
xb = (double *)malloc(n * sizeof(double)); // buffer x for caching
R_orderVector1(indx, n, group, TRUE, FALSE); // Er, how is TRUE & FALSE recogonized as Rboolean?
}
/* loop through groups */
for (i = 0; i < n_groups; i++) {
/* set group size gi */
gi = group_offset[i + 1] - group_offset[i];
/* special case for a factor level with no data */
if (gi == 0) {
for (j = 0; j < k; j++) {
/* matrix column for write-back */
matij = REAL(result[j]) + i * 5;
matij[0] = R_NaN; matij[1] = R_NaN; matij[2] = R_NaN;
matij[3] = R_NaN; matij[4] = R_NaN;
}
continue;
}
/* rank-deficient case */
if (gi == 1) {
gi = group_offset[i];
if (group_unsorted) gi = indx[gi];
for (j = 0; j < k; j++) {
/* matrix column for write-back */
matij = REAL(result[j]) + i * 5;
matij[0] = REAL(Y[j])[gi];
matij[1] = R_NaN; matij[2] = R_NaN;
matij[3] = R_NaN; matij[4] = 0.0;
}
continue;
}
/* general case where a regression line can be estimated */
inv_gi = 1 / (double)gi;
/* compute mean & variance of x values in this group */
xi_mean = 0.0; xi_var = 0.0;
if (group_unsorted) {
/* use u, u_end and xbi */
xi = REAL(x);
u = indx + group_offset[i]; /* offset acts on index */
u_end = u + gi;
xbi = xb + group_offset[i];
for (; u < u_end; xbi++, u++) {
dtmp = xi[*u];
xi_mean += dtmp;
xi_var += dtmp * dtmp;
*xbi = dtmp;
}
} else {
/* use xi and xi_end */
xi = REAL(x) + group_offset[i]; /* offset acts on values */
xi_end = xi + gi;
for (; xi < xi_end; xi++) {
xi_mean += *xi;
xi_var += (*xi) * (*xi);
}
}
xi_mean = xi_mean * inv_gi;
xi_var = xi_var * inv_gi - xi_mean * xi_mean;
/* loop through responses doing simple linear regression */
for (j = 0; j < k; j++) {
/* compute mean & variance of y values, as well its covariance with x values */
yi_mean = 0.0; yi_var = 0.0; xiyi_cov = 0.0;
if (group_unsorted) {
xbi = xb + group_offset[i]; /* use buffered x values */
yi = REAL(Y[j]);
u = indx + group_offset[i]; /* offset acts on index */
for (; u < u_end; u++, xbi++) {
dtmp = yi[*u];
yi_mean += dtmp;
yi_var += dtmp * dtmp;
xiyi_cov += dtmp * (*xbi);
}
} else {
/* set xi and yi */
xi = REAL(x) + group_offset[i]; /* offset acts on values */
yi = REAL(Y[j]) + group_offset[i]; /* offset acts on values */
for (; xi < xi_end; xi++, yi++) {
yi_mean += *yi;
yi_var += (*yi) * (*yi);
xiyi_cov += (*yi) * (*xi);
}
}
yi_mean = yi_mean * inv_gi;
yi_var = yi_var * inv_gi - yi_mean * yi_mean;
xiyi_cov = xiyi_cov * inv_gi - xi_mean * yi_mean;
/* locate the right place to write back regression result */
matij = REAL(result[j]) + i * 5 + 4;
/* residual degree of freedom */
df_resi = gi - 2; *matij-- = (double)df_resi;
/* R-squared = squared correlation */
r2 = (xiyi_cov * xiyi_cov) / (xi_var * yi_var); *matij-- = r2;
/* standard error of regression slope */
if (df_resi == 0) *matij-- = 0.0;
else *matij-- = sqrt((1 - r2) * yi_var / (df_resi * xi_var));
/* regression slope */
beta = xiyi_cov / xi_var; *matij-- = beta;
/* regression intercept */
*matij = yi_mean - beta * xi_mean;
}
}
if (group_unsorted) {
free(indx);
free(xb);
}
free(group_offset);
return result;
}
/*** R
group_by_simpleLM <- function (dat, LHS, RHS, group = NULL) {
## basic input validation
if (missing(dat)) stop("no data provided to 'dat'!")
if (!is.data.frame(dat)) stop("'dat' must be a data frame!")
if (missing(LHS)) stop("no 'LHS' provided!")
if (!is.character(LHS)) stop("'LHS' must be provided as a character vector of variable names!")
if (missing(RHS)) stop("no 'RHS' provided!")
if (!is.character(RHS)) stop("'RHS' must be provided as a character vector of variable names!")
if (!is.null(group)) {
## grouping variable provided: a fast method of `nlme::lmList`
if (!is.character(group)) stop("'group' must be provided as a character vector of variable names!")
## ensure that group has length 1, is available in the data frame and is a factor
if (length(group) > 1L) {
warning("only one grouping variable allowed for group-by simple linear regression; ignoring all but the 1st variable provided!")
group <- group[1L]
}
grp <- dat[[group]]
if (is.null(grp)) stop(sprintf("grouping variable '%s' not found in 'dat'!", group))
if (is.factor(grp)) {
grp_levels <- levels(grp)
} else {
warning("grouping variable is not provided as a factor; fast coercion is made!")
grp_levels <- unique(grp)
grp <- match(grp, grp_levels)
grp_levels <- as.character(grp_levels)
}
grp_unsorted <- .Internal(is.unsorted(grp, FALSE))
} else {
## no grouping; a fast method of `lm`
grp <- 1L; grp_levels <- "ALL"; grp_unsorted <- FALSE
}
## the RHS must has length 1, is available in the data frame and is numeric
if (length(RHS) > 1L) {
warning("only one RHS variable allowed for simple linear regression; ignoring all but the 1st variable provided!")
RHS <- RHS[1L]
}
x <- dat[[RHS]]
if (is.null(x)) stop(sprintf("RHS variable '%s' not found in 'dat'!", RHS))
if (!is.numeric(x) || is.factor(x)) {
stop("RHS variable must be 'numeric' for simple linear regression!")
}
x < as.numeric(x) ## just in case that `x` is an integer
## check LHS variables
nested <- match(RHS, LHS, nomatch = 0L)
if (nested > 0L) {
warning(sprintf("RHS variable '%s' found in LHS variables; removing it from LHS", RHS))
LHS <- LHS[-nested]
}
if (length(LHS) == 0L) stop("no usable LHS variables found!")
missed <- !(LHS %in% names(dat))
if (any(missed)) {
warning(sprintf("LHS variables '%s' not found in 'dat'; removing them from LHS", toString(LHS[missed])))
LHS <- LHS[!missed]
}
if (length(LHS) == 0L) stop("no usable LHS variables found!")
Y <- dat[LHS]
invalid_LHS <- vapply(Y, is.factor, FALSE) | (!vapply(Y, is.numeric, FALSE))
if (any(invalid_LHS)) {
warning(sprintf("LHS variables '%s' are non-numeric or factors; removing them from LHS", toString(LHS[invalid_LHS])))
Y <- Y[!invalid_LHS]
}
if (length(Y) == 0L) stop("no usable LHS variables found!")
Y <- lapply(Y, as.numeric) ## just in case that we have integer variables in Y
## check for exsitence of NA, NaN, Inf and -Inf and drop them?
## use Rcpp
group_by_simpleLM_cpp(Y, x, grp, grp_levels, grp_unsorted)
}
*/