我在R(时间和横截面)中有一个面板数据集,并且想要计算由两个维度聚类的标准误差,因为我的残差是双向相关的。谷歌搜索我发现http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/提供了这样做的功能。它似乎有点特别,所以我想知道是否有一个已经过测试的包并且这样做了吗?
我知道sandwich
会出现HAC标准错误,但它不会进行双重聚类(即沿着两个维度)。
答案 0 :(得分:7)
这是一个古老的问题。但是,鉴于人们似乎仍在着陆,我想我会为R中的多路聚类提供一些现代方法:
fixest::feols()
library(fixest)
nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")
est_feols = feols(ln_wage ~ age | race + year, data = nlswork)
## SEs will automatically be clustered by the first FE (i.e. race) in the above model
est_feols
## But we can instantaneously compute other SEs on the fly with summary.fixest()
summary(est_feols, se = 'standard') ## vanilla SEs
summary(est_feols, se = 'white') ## robust SEs
summary(est_feols, se = 'twoway') ## twoway clustering
summary(est_feols, cluster = c('race', 'year')) ## same as the above
summary(est_feols, cluster = c('race', 'year', 'idcode')) ## add third cluster var (not in original model call)
lfe::felm()
library(lfe)
## Unlike fixest::feols, here we specify the clusters in the actual model call.
## (Note the third "| 0 " slot means we're not using IV)
est_felm = felm(ln_wage ~ age | race + year | 0 | race + year + idcode, data = nlswork)
summary(est_felm)
library(sandwich)
library(lmtest)
est_sandwich = lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork)
coeftest(est_sandwich, vcov = vcovCL, cluster = ~ race + year)
Aaaand,只是为了迷惑速度。这是三种不同方法(使用两个固定的FE和双向集群)的基准。
est_feols = function() {summary(feols(ln_wage ~ age | race + year, data = nlswork),
cluster = c('race', 'year'))}
est_felm = function() felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
est_standwich = function() {coeftest(lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork),
vcov = vcovCL, cluster = ~ race + year)}
microbenchmark(est_feols(), est_felm(), est_standwich(), times = 3)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> est_feols() 10.40799 10.54351 11.71474 10.67902 12.36811 14.05719 3 a
#> est_felm() 99.56899 108.89241 112.55856 118.21584 119.05334 119.89085 3 a
#> est_standwich() 190.30892 198.92584 245.12421 207.54276 272.53185 337.52095 3 b
答案 1 :(得分:6)
对于面板回归, plm
包可以估算两个维度的群集SE。
使用M. Petersen’s benchmark results:
require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")
##Double-clustering formula (Thompson, 2011)
vcovDC <- function(x, ...){
vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) -
vcovHC(x, method="white1", ...)
}
fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))
现在您可以获得群集SE:
##Clustered by *group*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.066952 0.4433 0.6576
x 1.034833 0.050550 20.4714 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##Clustered by *time*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.022189 1.3376 0.1811
x 1.034833 0.031679 32.6666 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##Clustered by *group* and *time*
> coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.064580 0.4596 0.6458
x 1.034833 0.052465 19.7243 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
有关详细信息,请参阅:
但是,只有当您的数据可以强制转换为pdata.frame
时,上述方法才有效。如果您有"duplicate couples (time-id)"
,它将失败。在这种情况下,您仍然可以聚类,但只能沿着一个维度聚类。
通过仅指定一个索引,诱使plm
认为您拥有正确的面板数据集:
fpm.tr <- plm(y ~ x, test, model='pooling', index=c('firmid'))
现在您可以获得群集SE:
##Clustered by *group*
> coeftest(fpm.tr, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.066952 0.4433 0.6576
x 1.034833 0.050550 20.4714 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
您还可以使用此变通方法通过更高维度或更高级别(例如industry
或country
)进行群集。但是,在这种情况下,您将无法使用group
(或time
)effects
,这是该方法的主要限制。
适用于面板和其他类型数据的另一种方法是 multiwayvcov
包。它允许双重聚类,但也可以在更高的维度进行聚类。根据包的website,它是对Arai代码的改进:
- 由于缺失而导致观察的透明处理下降
- 完全多路(或n路,或n维或多维)聚类
使用Petersen数据和cluster.vcov()
:
library("lmtest")
library("multiwayvcov")
data(petersen)
m1 <- lm(y ~ x, data = petersen)
coeftest(m1, vcov=function(x) cluster.vcov(x, petersen[ , c("firmid", "year")]))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.029680 0.065066 0.4561 0.6483
## x 1.034833 0.053561 19.3206 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
答案 2 :(得分:5)
Arai的功能可用于聚类标准错误。他有另一个版本用于多维聚类:
mcl <- function(dat,fm, cluster1, cluster2){
attach(dat, warn.conflicts = F)
library(sandwich);library(lmtest)
cluster12 = paste(cluster1,cluster2, sep="")
M1 <- length(unique(cluster1))
M2 <- length(unique(cluster2))
M12 <- length(unique(cluster12))
N <- length(cluster1)
K <- fm$rank
dfc1 <- (M1/(M1-1))*((N-1)/(N-K))
dfc2 <- (M2/(M2-1))*((N-1)/(N-K))
dfc12 <- (M12/(M12-1))*((N-1)/(N-K))
u1j <- apply(estfun(fm), 2, function(x) tapply(x, cluster1, sum))
u2j <- apply(estfun(fm), 2, function(x) tapply(x, cluster2, sum))
u12j <- apply(estfun(fm), 2, function(x) tapply(x, cluster12, sum))
vc1 <- dfc1*sandwich(fm, meat=crossprod(u1j)/N )
vc2 <- dfc2*sandwich(fm, meat=crossprod(u2j)/N )
vc12 <- dfc12*sandwich(fm, meat=crossprod(u12j)/N)
vcovMCL <- vc1 + vc2 - vc12
coeftest(fm, vcovMCL)}
有关参考和用法示例,请参阅:
答案 3 :(得分:4)
Frank Harrell的软件包rms
(以前名为Design
)具有我在群集时经常使用的功能:robcov
。
例如,请参阅?robcov
的这一部分。
cluster: a variable indicating groupings. ‘cluster’ may be any type of
vector (factor, character, integer). NAs are not allowed.
Unique values of ‘cluster’ indicate possibly correlated
groupings of observations. Note the data used in the fit and
stored in ‘fit$x’ and ‘fit$y’ may have had observations
containing missing values deleted. It is assumed that if any
NAs were removed during the original model fitting, an
‘naresid’ function exists to restore NAs so that the rows of
the score matrix coincide with ‘cluster’. If ‘cluster’ is
omitted, it defaults to the integers 1,2,...,n to obtain the
"sandwich" robust covariance matrix estimate.