Question

我想遍历线性模型列表，并使用vcovCL函数将“聚类”标准误差应用于每个模型。我的目标是尽可能高效地执行此操作（我正在跨数据框的许多列运行线性模型）。我的问题是尝试在匿名函数内部指定其他参数。下面我模拟了一些假数据。区域代表我的横截面尺寸；个月代表我的时间维度（4个月内观察到5个单位）。变量int是干预发生时的虚拟变量。

df <- data.frame(
  precinct = c( rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4), rep(5, 4) ),
  month = rep(1:4, 5),
  crime = rnorm(20, 10, 5),
  int = c(c(0, 1, 1, 0), rep(0, 4), rep(0, 4), c(1, 1, 1, 0), rep(0, 4))
  )

df[1:10, ]

outcome <- df[3]
est <- lapply(outcome, FUN = function(x) { lm(x ~ as.factor(precinct) + as.factor(month) + int, data = df) })

se <- lapply(est, function(x) { sqrt(diag(vcovCL(x, cluster = ~ precinct + month))) })

在cluster函数内部添加vcovCL参数时，我收到以下错误消息。

Error in eval(expr, envir, enclos) : object 'x' not found

据我估计，解决此问题的唯一方法是为数据帧建立索引，即df$，然后指定“聚类”变量。是否可以通过在函数调用内为df指定一个附加参数来实现？ 此代码有效吗？

我想也许以公式方式指定模型方程是更好的方法。

任何想法/评论总是有用的：）

Answer 1

这是一种可以为多个模型检索聚类标准错误的方法：

library(sandwich)

# I am going to use the same model three times to get the "sequence" of linear models. 
mod <- lm(crime ~ as.factor(precinct) + as.factor(month) + int, data = df)

# define function to retrieve standard errors:
robust_se <- function(mod) {sqrt(diag(vcovCL(mod, cluster = list(df$precinct, df$month))))}

# apply function to all models:
se <- lapply(list(mod, mod, mod), robust_se)

如果要调整整个输出，则以下内容可能会有所帮助：

library(lmtest)
adj_stats <- function(mod) {coeftest(mod, vcovCL(mod, cluster = list(df$precinct, df$month)))}

adjusted_models <- lapply(list(mod, mod, mod), adj_stats)

要解决多列问题：

如果您要在多个列上运行线性模型而遇到困难，则以下内容可能会有所帮助。除了将模型列表传递给lapply以外，所有上述内容均保持不变。

首先，让我们在这里使用此数据框：

df <- data.frame(
  precinct = c( rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4), rep(5, 4) ),
  month = rep(1:4, 5),
  crime = rnorm(20, 10, 5),
  crime2 = rnorm(20, 10, 5),
  crime3 = rnorm(20, 10, 5),
  int = c(c(0, 1, 1, 0), rep(0, 4), rep(0, 4), c(1, 1, 1, 0), rep(0, 4))
)

让我们定义结果列：

outcome_columns <- c("crime", "crime2", "crime3")

现在，让我们对每个结果进行回归：

models <- lapply(outcome_columns, 
         function(outcome) lm( eval(parse(text = paste0(outcome, " ~ as.factor(precinct) + as.factor(month) + int"))), data = df) )

然后您只需致电

adjusted_models <- lapply(models, adj_stats)

关于效率：

上面的代码高效，因为它易于调整并且编写起来很快。对于大多数用例，这将是完美的。为了提高计算效率，请注意，您的设计矩阵在所有情况下都是相同的，即通过预先计算通用元素（例如inv(X'X)*X'），可以节省一些计算量。但是，您将失去许多内置功能的便利性。

将“聚类功能”应用于一系列线性模型

1 个答案: