面板数据的双集群标准错误

时间:2011-12-05 18:12:42

标签: r regression standard-error panel-data plm

我在R(时间和横截面)中有一个面板数据集,并且想要计算由两个维度聚类的标准误差,因为我的残差是双向相关的。谷歌搜索我发现http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/提供了这样做的功能。它似乎有点特别,所以我想知道是否有一个已经过测试的包并且这样做了吗?

我知道sandwich会出现HAC标准错误,但它不会进行双重聚类(即沿着两个维度)。

4 个答案:

答案 0 :(得分:7)

这是一个古老的问题。但是,鉴于人们似乎仍在着陆,我想我会为R中的多路聚类提供一些现代方法:

选项1(最快):fixest::feols()

library(fixest)

nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")

est_feols = feols(ln_wage ~ age | race + year, data = nlswork)

## SEs will automatically be clustered by the first FE (i.e. race) in the above model
est_feols

## But we can instantaneously compute other SEs on the fly with summary.fixest()
summary(est_feols, se = 'standard') ## vanilla SEs
summary(est_feols, se = 'white') ## robust SEs
summary(est_feols, se = 'twoway') ## twoway clustering
summary(est_feols, cluster = c('race', 'year')) ## same as the above
summary(est_feols, cluster = c('race', 'year', 'idcode'))  ## add third cluster var (not in original model call)

选项2(快速):lfe::felm()

library(lfe)

## Unlike fixest::feols, here we specify the clusters in the actual model call.
## (Note the third "| 0 " slot means we're not using IV) 

est_felm = felm(ln_wage ~ age | race + year | 0 | race + year + idcode, data = nlswork)
summary(est_felm)

选项3(速度较慢但灵活):三明治

library(sandwich)
library(lmtest)


est_sandwich = lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork) 
coeftest(est_sandwich, vcov = vcovCL, cluster = ~ race + year)

基准

Aaaand,只是为了迷惑速度。这是三种不同方法(使用两个固定的FE和双向集群)的基准。

est_feols = function() {summary(feols(ln_wage ~ age | race + year, data = nlswork), 
                               cluster = c('race', 'year'))} 
est_felm = function() felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
est_standwich = function() {coeftest(lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork), 
                                     vcov = vcovCL, cluster = ~ race + year)}

microbenchmark(est_feols(), est_felm(), est_standwich(), times = 3)

#> Unit: milliseconds
#>             expr       min        lq      mean    median        uq       max neval cld
#>      est_feols()  10.40799  10.54351  11.71474  10.67902  12.36811  14.05719     3  a 
#>       est_felm()  99.56899 108.89241 112.55856 118.21584 119.05334 119.89085     3  a 
#>  est_standwich() 190.30892 198.92584 245.12421 207.54276 272.53185 337.52095     3   b

答案 1 :(得分:6)

对于面板回归, plm 包可以估算两个维度的群集SE。

使用M. Petersen’s benchmark results

require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")

##Double-clustering formula (Thompson, 2011)
vcovDC <- function(x, ...){
    vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) - 
        vcovHC(x, method="white1", ...)
}

fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))

现在您可以获得群集SE:

##Clustered by *group*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

##Clustered by *time*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.022189  1.3376   0.1811    
x           1.034833   0.031679 32.6666   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

##Clustered by *group* and *time*
> coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.064580  0.4596   0.6458    
x           1.034833   0.052465 19.7243   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

有关详细信息,请参阅:


但是,只有当您的数据可以强制转换为pdata.frame时,上述方法才有效。如果您有"duplicate couples (time-id)",它将失败。在这种情况下,您仍然可以聚类,但只能沿着一个维度聚类。

通过仅指定一个索引,诱使plm认为您拥有正确的面板数据集:

fpm.tr <- plm(y ~ x, test, model='pooling', index=c('firmid'))

现在您可以获得群集SE:

##Clustered by *group*
> coeftest(fpm.tr, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

您还可以使用此变通方法通过更高维度更高级别(例如industrycountry)进行群集。但是,在这种情况下,您将无法使用group(或timeeffects,这是该方法的主要限制。


适用于面板和其他类型数据的另一种方法是 multiwayvcov 包。它允许双重聚类,但也可以在更高的维度进行聚类。根据包的website,它是对Arai代码的改进:

  
      
  • 由于缺失而导致观察的透明处理下降
  •   
  • 完全多路(或n路,或n维或多维)聚类
  •   

使用Petersen数据和cluster.vcov()

library("lmtest")
library("multiwayvcov")

data(petersen)
m1 <- lm(y ~ x, data = petersen)

coeftest(m1, vcov=function(x) cluster.vcov(x, petersen[ , c("firmid", "year")]))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.029680   0.065066  0.4561   0.6483    
## x           1.034833   0.053561 19.3206   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

答案 2 :(得分:5)

Arai的功能可用于聚类标准错误。他有另一个版本用于多维聚类:

mcl <- function(dat,fm, cluster1, cluster2){
          attach(dat, warn.conflicts = F)
          library(sandwich);library(lmtest)
          cluster12 = paste(cluster1,cluster2, sep="")
          M1  <- length(unique(cluster1))
          M2  <- length(unique(cluster2))   
          M12 <- length(unique(cluster12))
          N   <- length(cluster1)          
          K   <- fm$rank             
          dfc1  <- (M1/(M1-1))*((N-1)/(N-K))  
          dfc2  <- (M2/(M2-1))*((N-1)/(N-K))  
          dfc12 <- (M12/(M12-1))*((N-1)/(N-K))  
          u1j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster1,  sum)) 
          u2j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster2,  sum)) 
          u12j  <- apply(estfun(fm), 2, function(x) tapply(x, cluster12, sum)) 
          vc1   <-  dfc1*sandwich(fm, meat=crossprod(u1j)/N )
          vc2   <-  dfc2*sandwich(fm, meat=crossprod(u2j)/N )
          vc12  <- dfc12*sandwich(fm, meat=crossprod(u12j)/N)
          vcovMCL <- vc1 + vc2 - vc12
          coeftest(fm, vcovMCL)}

有关参考和用法示例,请参阅:

答案 3 :(得分:4)

Frank Harrell的软件包rms(以前名为Design)具有我在群集时经常使用的功能:robcov

例如,请参阅?robcov的这一部分。

cluster: a variable indicating groupings. ‘cluster’ may be any type of
      vector (factor, character, integer).  NAs are not allowed.
      Unique values of ‘cluster’ indicate possibly correlated
      groupings of observations. Note the data used in the fit and
      stored in ‘fit$x’ and ‘fit$y’ may have had observations
      containing missing values deleted. It is assumed that if any
      NAs were removed during the original model fitting, an
      ‘naresid’ function exists to restore NAs so that the rows of
      the score matrix coincide with ‘cluster’. If ‘cluster’ is
      omitted, it defaults to the integers 1,2,...,n to obtain the
      "sandwich" robust covariance matrix estimate.