包含NA的数据的群集标准错误

时间:2014-04-26 16:50:47

标签: r regression standard-error

我无法使用基于此post的R和指南来聚类标准错误。 cl函数返回错误:

Error in tapply(x, cluster1, sum) : arguments must have same length

在阅读tapply之后,我仍然不确定为什么我的群集参数长度错误,以及导致此错误的原因。

这是我正在使用的数据集的链接。

https://www.dropbox.com/s/y2od7um9pp4vn0s/Ec%201820%20-%20DD%20Data%20with%20Controls.csv

这是R代码:

# read in data
charter<-read.csv(file.choose())
View(charter)
colnames(charter)

# standardize NAEP scores
charter$naep.standardized <- (charter$naep - mean(charter$naep, na.rm=T))/sd(charter$naep, na.rm=T)

# change NAs in year.passed column to 2014
charter$year.passed[is.na(charter$year.passed)]<-2014

# Add column with indicator for in treatment (passed legislation)
charter$treatment<-ifelse(charter$year.passed<=charter$year,1,0)

# fit model
charter.model<-lm(naep ~ factor(year) + factor(state) + treatment, data = charter)
summary(charter.model)
# account for clustered standard errors by state
cl(dat=charter, fm=charter.model, cluster=charter$state)

# accounting for controls
charter.model.controls<-lm(naep~factor)

# clustered standard errors
# ---------

# function that calculates clustered standard errors
# source: http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/
cl   <- function(dat, fm, cluster){
  require(sandwich, quietly = TRUE)
  require(lmtest, quietly = TRUE)
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- fm$rank
  dfc <- (M/(M-1))*((N-1)/(N-K))
  print(K)
  uj  <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
  vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
  coeftest(fm, vcovCL) 
}

# calculate clustered standard errors 
cl(charter, charter.model, charter$state)

这个功能的内部运作有点过头了。

2 个答案:

答案 0 :(得分:5)

执行代码时,请注意线性模型中缺少观察值:

> summary(charter.model)

Call:
lm(formula = naep ~ factor(year) + factor(state) + treatment, 
    data = charter)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2420  -1.6740  -0.2024   1.8345  12.3580 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 250.4983     1.2115 206.767  < 2e-16 ***
factor(year)1992              3.7970     0.7198   5.275 2.17e-07 ***
factor(year)1996              7.0436     0.8607   8.183 3.64e-15 ***

[..]

Residual standard error: 3.128 on 404 degrees of freedom
  (759 observations deleted due to missingness)
Multiple R-squared:  0.9337,    Adjusted R-squared:  0.9239 
F-statistic: 94.85 on 60 and 404 DF,  p-value: < 2.2e-16

这是导致您看到的Error in tapply(x, cluster1, sum) : arguments must have same length错误消息的原因。

cl(dat=charter, fm=charter.model, cluster=charter$state)中,群集变量charter$state应该具有与回归估计中有效使用的观察数量完全相同的长度(由于NA与行中的行数不同)原始数据框)。


要解决此问题,您可以执行以下操作。

  1. 首先,您使用的是Arai函数的旧版本(cl)(请参阅Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R以获取对旧版本或新版本的引用,后者称为{{1 }})。

  2. 其次我认为Arai对此功能的原始方法有点复杂,并不真正遵循clx vcov*函数的标准接口。这就是为什么我带来了sandwich的略微修改版本。我使代码更具可读性,界面更像您对clx sandwich函数的期望:

    vcov*
  3. 如果您对数据尝试此功能,您会发现它捕获了这个特定问题:

    vcovCL <- function(x, cluster.by, type="sss", dfcw=1){
        # R-codes (www.r-project.org) for computing
        # clustered-standard errors. Mahmood Arai, Jan 26, 2008.
    
        # The arguments of the function are:
        # fitted model, cluster1 and cluster2
        # You need to install libraries `sandwich' and `lmtest'
    
        # reweighting the var-cov matrix for the within model
        require(sandwich)
        cluster <- cluster.by
        M <- length(unique(cluster))   
        N <- length(cluster)
        stopifnot(N == length(x$residuals))
        K <- x$rank
        ##only Stata small-sample correction supported right now 
        ##see plm >= 1.5-4
        stopifnot(type=="sss")  
        if(type=="sss"){
            dfc <- (M/(M-1))*((N-1)/(N-K))
        }
        uj  <- apply(estfun(x), 2, function(y) tapply(y, cluster, sum))
        mycov <- dfc * sandwich(x, meat=crossprod(uj)/N) * dfcw
        return(mycov)
    }
    

    要避免此问题,您可以按以下步骤操作:

    > coeftest(charter.model, vcov=function(x) vcovCL(x, charter$state))
     Error: N == length(x$residuals) is not TRUE
    

    这不好,但完成工作。现在> charter.x <- na.omit(charter[ , c("state", all.vars(formula(charter.model)))]) > coeftest(charter.model, vcov=function(x) vcovCL(x, charter.x$state)) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.5050e+02 9.3781e-01 2.6711e+02 < 2.2e-16 *** factor(year)1992 3.7970e+00 5.6019e-01 6.7780e+00 4.330e-11 *** factor(year)1996 7.0436e+00 8.8574e-01 7.9522e+00 1.856e-14 *** factor(year)2000 8.4313e+00 1.0906e+00 7.7311e+00 8.560e-14 *** factor(year)2003 1.2392e+01 1.1670e+00 1.0619e+01 < 2.2e-16 *** factor(year)2005 1.3490e+01 1.1747e+00 1.1484e+01 < 2.2e-16 *** factor(year)2007 1.6334e+01 1.2469e+00 1.3100e+01 < 2.2e-16 *** factor(year)2009 1.8118e+01 1.2556e+00 1.4430e+01 < 2.2e-16 *** factor(year)2011 1.9110e+01 1.3459e+00 1.4199e+01 < 2.2e-16 *** factor(year)2013 1.9301e+01 1.4896e+00 1.2957e+01 < 2.2e-16 *** factor(state)Alaska 1.4178e+01 8.7686e-01 1.6169e+01 < 2.2e-16 *** factor(state)Arizona 8.6313e+00 8.1439e-01 1.0598e+01 < 2.2e-16 *** factor(state)Arkansas 4.3313e+00 8.1439e-01 5.3185e+00 1.736e-07 *** factor(state)California 3.1103e+00 9.1619e-01 3.3948e+00 0.0007549 *** factor(state)Colorado 1.7939e+01 7.9736e-01 2.2498e+01 < 2.2e-16 *** factor(state)Connecticut 1.8031e+01 8.1439e-01 2.2141e+01 < 2.2e-16 *** factor(state)D.C. -1.8369e+01 8.1439e-01 -2.2555e+01 < 2.2e-16 *** factor(state)Delaware 1.2050e+01 7.9736e-01 1.5113e+01 < 2.2e-16 *** factor(state)Florida 7.3838e+00 7.9736e-01 9.2602e+00 < 2.2e-16 *** factor(state)Georgia 6.4313e+00 8.1439e-01 7.8971e+00 2.724e-14 *** factor(state)Hawaii 3.3313e+00 8.1439e-01 4.0906e+00 5.196e-05 *** factor(state)Idaho 1.7118e+01 7.8321e-01 2.1857e+01 < 2.2e-16 *** factor(state)Illinois 1.2670e+01 8.2224e-01 1.5409e+01 < 2.2e-16 *** factor(state)Indianna 1.7174e+01 6.1079e-01 2.8117e+01 < 2.2e-16 *** factor(state)Iowa 2.0074e+01 6.8460e-01 2.9322e+01 < 2.2e-16 *** factor(state)Kansas 2.0123e+01 8.6796e-01 2.3184e+01 < 2.2e-16 *** factor(state)Kentucky 1.0200e+01 4.1999e-14 2.4287e+14 < 2.2e-16 *** factor(state)Louisiana -1.6866e-01 8.1439e-01 -2.0710e-01 0.8360322 factor(state)Maine 2.0231e+01 1.7564e-01 1.1518e+02 < 2.2e-16 *** factor(state)Maryland 1.4274e+01 6.1079e-01 2.3369e+01 < 2.2e-16 *** factor(state)Massachusetts 2.4868e+01 8.3960e-01 2.9619e+01 < 2.2e-16 *** factor(state)Michigan 1.2031e+01 8.1439e-01 1.4773e+01 < 2.2e-16 *** factor(state)Minnesota 2.5110e+01 9.1619e-01 2.7407e+01 < 2.2e-16 *** factor(state)Mississippi -3.5470e+00 1.7564e-01 -2.0195e+01 < 2.2e-16 *** factor(state)Missouri 1.3447e+01 7.2706e-01 1.8495e+01 < 2.2e-16 *** factor(state)Montana 2.2512e+01 8.4814e-01 2.6543e+01 < 2.2e-16 *** factor(state)Nebraska 1.9600e+01 4.3105e-14 4.5471e+14 < 2.2e-16 *** factor(state)Nevada 4.9800e+00 8.6796e-01 5.7375e+00 1.887e-08 *** factor(state)New Hampshire 2.2026e+01 7.6338e-01 2.8853e+01 < 2.2e-16 *** factor(state)New Jersey 2.0651e+01 7.6338e-01 2.7052e+01 < 2.2e-16 *** factor(state)New Mexico 1.5313e+00 8.1439e-01 1.8803e+00 0.0607809 . factor(state)New York 1.2152e+01 7.1259e-01 1.7054e+01 < 2.2e-16 *** factor(state)North Carolina 1.2231e+01 8.1439e-01 1.5019e+01 < 2.2e-16 *** factor(state)North Dakota 2.4278e+01 1.0420e-01 2.3299e+02 < 2.2e-16 *** factor(state)Ohio 1.7118e+01 7.8321e-01 2.1857e+01 < 2.2e-16 *** factor(state)Oklahoma 8.4518e+00 7.8321e-01 1.0791e+01 < 2.2e-16 *** factor(state)Oregon 1.6535e+01 7.3538e-01 2.2486e+01 < 2.2e-16 *** factor(state)Pennsylvania 1.6651e+01 7.6338e-01 2.1812e+01 < 2.2e-16 *** factor(state)Rhode Island 9.5313e+00 8.1439e-01 1.1704e+01 < 2.2e-16 *** factor(state)South Carolina 9.5346e+00 8.3960e-01 1.1356e+01 < 2.2e-16 *** factor(state)South Dakota 2.1211e+01 3.5103e-01 6.0425e+01 < 2.2e-16 *** factor(state)Tennessee 4.9148e+00 6.1473e-01 7.9951e+00 1.375e-14 *** factor(state)Texas 1.4231e+01 8.1439e-01 1.7475e+01 < 2.2e-16 *** factor(state)Utah 1.5114e+01 7.2706e-01 2.0787e+01 < 2.2e-16 *** factor(state)Vermont 2.3474e+01 2.0299e-01 1.1564e+02 < 2.2e-16 *** factor(state)Virginia 1.6252e+01 7.1259e-01 2.2807e+01 < 2.2e-16 *** factor(state)Washington 1.9073e+01 1.8183e-01 1.0489e+02 < 2.2e-16 *** factor(state)West Virginia 5.0000e+00 4.2022e-14 1.1899e+14 < 2.2e-16 *** factor(state)Wisconsin 1.9994e+01 8.2447e-01 2.4251e+01 < 2.2e-16 *** factor(state)Wyoming 1.8231e+01 8.1439e-01 2.2386e+01 < 2.2e-16 *** treatment 1.2108e+00 1.0180e+00 1.1894e+00 0.2349682 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 也可以正常工作并产生与上面相同的结果:

    cl

    更好的方法是使用 cl(dat=charter, fm=charter.model, cluster=charter.x$state) 包。根据包的website,它是对Arai代码的改进:

      

    由于缺失而导致观察的透明处理下降

    将Petersen数据与模拟的NA和multiwayvcov

    一起使用
    cluster.vcov()

    对于使用library("lmtest") library("multiwayvcov") data(petersen) set.seed(123) petersen[ sample(1:5000, 15), 3] <- NA m1 <- lm(y ~ x, data = petersen) summary(m1) ## ## Call: ## lm(formula = y ~ x, data = petersen) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.759 -1.371 -0.018 1.340 8.680 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.02793 0.02842 0.983 0.326 ## x 1.03635 0.02865 36.175 <2e-16 *** ## --- ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ## ## Residual standard error: 2.007 on 4983 degrees of freedom ## (15 observations deleted due to missingness) ## Multiple R-squared: 0.208, Adjusted R-squared: 0.2078 ## F-statistic: 1309 on 1 and 4983 DF, p-value: < 2.2e-16 coeftest(m1, vcov=function(x) cluster.vcov(x, petersen$firmid)) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.027932 0.067198 0.4157 0.6777 ## x 1.036354 0.050700 20.4407 <2e-16 *** ## --- ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 包的其他方法,请参阅:

答案 1 :(得分:-1)

对于单向群集,来自robcov包的{rms}命令工作得非常好。阅读本文以获取更多信息 http://www.inside-r.org/packages/cran/rms/docs/robcov