Question

我正在尝试在R中复制Stata输出。我正在使用数据集affairs。我无法通过强大的标准错误复制probit函数。

Stata代码如下：

probit affair male age yrsmarr kids relig educ ratemarr, r

我开始时：

 probit1 <- glm(affair ~ male + age + yrsmarr + kids + relig + educ + ratemarr, 
           family = binomial (link = "probit"), data = mydata)

然后我尝试使用sandwich包进行各种调整，例如：

myProbit <- function(probit1, vcov = sandwich(..., adjust = TRUE)) {
            print(coeftest(probit1, vcov = sandwich(probit1, adjust = TRUE)))
}

或（所有类型HC0至HC5）：

myProbit <- function(probit1, vcov = sandwich) {
            print(coeftest(probit1, vcovHC(probit1, type = "HC0"))  
}

或者这样，按照建议here（我是否必须为object输入不同的内容？）：

sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1)
coeftest(probit1, vcov = sandwich1)

这些尝试都没有导致stata输出中出现相同的标准错误或z值。

希望有一些建设性的想法！

提前致谢！

Answer 1

对于正在考虑跳上这辆旅行车的人来说，这里有一些代码可以证明这个问题（数据here）：

clear
set more off
capture ssc install bcuse
capture ssc install rsource
bcuse affairs

saveold affairs, version(12) replace

rsource, terminator(XXX)
  library("foreign")
  library("lmtest")
  library("sandwich")
  mydata<-read.dta("affairs.dta")
  probit1<-glm(affair ~ male + age + yrsmarr + kids + relig + educ + ratemarr, family = binomial (link = "probit"), data = mydata)
  sandwich1 <- function(object,...) sandwich(object) * nobs(object)/(nobs(object) - 1)
  coeftest(probit1,vcov = sandwich1)
XXX 

probit affair male age yrsmarr kids relig educ ratemarr, robust cformat(%9.6f) nolog

R给出：

z test of coefficients:

             Estimate Std. Error z value  Pr(>|z|)    
(Intercept)  0.764157   0.546692  1.3978 0.1621780    
male         0.188816   0.133260  1.4169 0.1565119    
age         -0.024400   0.011423 -2.1361 0.0326725 *  
yrsmarr      0.054608   0.019025  2.8703 0.0041014 ** 
kids         0.208072   0.168222  1.2369 0.2161261    
relig       -0.186085   0.053968 -3.4480 0.0005647 ***
educ         0.015506   0.026389  0.5876 0.5568012    
ratemarr    -0.272711   0.053668 -5.0814 3.746e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Stata产量：

Probit regression                               Number of obs     =        601
                                                Wald chi2(7)      =      54.93
                                                Prob > chi2       =     0.0000
Log pseudolikelihood =  -305.2525               Pseudo R2         =     0.0961

------------------------------------------------------------------------------
             |               Robust
      affair |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   0.188817   0.131927     1.43   0.152    -0.069755    0.447390
         age |  -0.024400   0.011124    -2.19   0.028    -0.046202   -0.002597
     yrsmarr |   0.054608   0.018963     2.88   0.004     0.017441    0.091775
        kids |   0.208075   0.166243     1.25   0.211    -0.117754    0.533905
       relig |  -0.186085   0.053240    -3.50   0.000    -0.290435   -0.081736
        educ |   0.015505   0.026355     0.59   0.556    -0.036150    0.067161
    ratemarr |  -0.272710   0.053392    -5.11   0.000    -0.377356   -0.168064
       _cons |   0.764160   0.534335     1.43   0.153    -0.283117    1.811437
------------------------------------------------------------------------------

<强>附录：

系数的协方差估计的差异是由于不同的拟合算法。在R中，glm命令使用迭代最小二乘法，而Stata的probit使用基于Newton-Raphson算法的ML方法。您可以使用glm选项与<{1}}选项中的R irls匹配R：

glm affair male age yrsmarr kids relig educ ratemarr, irls family(binomial) link(probit) robust

这会产生：

Generalized linear models                         No. of obs      =        601
Optimization     : MQL Fisher scoring             Residual df     =        593
                   (IRLS EIM)                     Scale parameter =          1
Deviance         =  610.5049916                   (1/df) Deviance =   1.029519
Pearson          =  619.0405832                   (1/df) Pearson  =   1.043913

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = invnorm(u)              [Probit]

                                                  BIC             =  -3183.862

------------------------------------------------------------------------------
             |             Semirobust
      affair |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   0.188817   0.133260     1.42   0.157    -0.072367    0.450002
         age |  -0.024400   0.011422    -2.14   0.033    -0.046787   -0.002012
     yrsmarr |   0.054608   0.019025     2.87   0.004     0.017319    0.091897
        kids |   0.208075   0.168222     1.24   0.216    -0.121634    0.537785
       relig |  -0.186085   0.053968    -3.45   0.001    -0.291862   -0.080309
        educ |   0.015505   0.026389     0.59   0.557    -0.036216    0.067226
    ratemarr |  -0.272710   0.053668    -5.08   0.000    -0.377898   -0.167522
       _cons |   0.764160   0.546693     1.40   0.162    -0.307338    1.835657
------------------------------------------------------------------------------

这些将会很接近，但不完全相同。我不知道如何让R在没有大量工作的情况下使用像NR这样的东西。

Answer 2

我正在使用详细描述的here（p.57）中的矩阵方法来将R结果与Stata相匹配。但是，我还不能完全匹配结果。我认为差异可能是因为分数不同。 R中的得分与Stata匹配，最多只有4位小数。

<强>的Stata

clear all
bcuse affairs

probit affair male age yrsmarr kids relig educ ratemarr
mat var_nr=e(V)
predict double u, score
matrix accum s = male age yrsmarr kids relig educ ratemarr [iweight=u^2*601/600] //n=601,n-1=600
matrix rv = var_nr*s*var_nr
mat diagrv=vecdiag(rv)
matmap diagrv rse,m(sqrt(@)) //install matmap 
mat list rse //standard errors

这会给您带来与以下相同的标准错误：

qui probit affair male age yrsmarr kids relig educ ratemarr,r



rse[1,8]
       affair:    affair:    affair:    affair:    affair:    affair:    affair:    affair:
         male        age    yrsmarr       kids      relig       educ   ratemarr      _cons
r1  .13192707  .01112372  .01896336  .16624258  .05324046  .02635524  .05339163  .53433495

R：

library(AER) # Affairs data
data(Affairs)
mydata<-Affairs
mydata$affairs<-with(mydata,ifelse(affairs>0,1,affairs)) # convert to 1 and 0 
probit1<-glm(affairs ~ gender+ age + yearsmarried + children + religiousness+education + rating,family = binomial(link = "probit"),data = mydata)
u<-subset(estfun(probit1),select="(Intercept)") #scores: perfectly matches to 4 decimals with Stata: difference may be due to this step
w0<-u%*%t(u)*(601/600) #(n/n-1)
iweight<-matrix(0,nrow=601,ncol=601) #perfectly matches to 4 decimals with Stata 
diag(iweight)<-diag(w0) 
x<-model.matrix(probit1)  
s<-t(x)%*%iweight%*%x #doesn't match with Stata : 
rv<-vcov(probit1)%*%s%*%vcov(probit1)
rse<-sqrt(diag(rv)) # standard  errors
   rse
  (Intercept)    gendermale           age  yearsmarried   childrenyes religiousness     education        rating 
   0.54669177    0.13325951    0.01142258    0.01902537    0.16822161    0.05396841    0.02638902    0.05366828

这符合：

 sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1)
coeftest(probit1, vcov = sandwich1)

结论：R和Stata之间的结果差异是由于得分的差异（仅匹配最多4位小数）。

Answer 3

在本次讨论中，您可以使用std::thread进行估算，并使用sampleSelection::probit包（我使用2.5版）来计算鲁棒的标准误差，从而匹配R中的原始Stata输出。。 sandwich函数及其Stata对应函数使用最大似然。

与原始帖子一样，Stata代码为

probit

给出

probit affair male age yrsmarr kids relig educ ratemarr, robust

给出相同结果的R代码是

------------------------------------------------------------------------------
             |               Robust
      affair |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   .1888175   .1319271     1.43   0.152    -.0697548    .4473898
         age |  -.0243996   .0111237    -2.19   0.028    -.0462017   -.0025975
     yrsmarr |    .054608   .0189634     2.88   0.004     .0174405    .0917755
        kids |   .2080754   .1662426     1.25   0.211     -.117754    .5339049
       relig |  -.1860854   .0532405    -3.50   0.000    -.2904348    -.081736
        educ |   .0155052   .0263552     0.59   0.556    -.0361501    .0671605
    ratemarr |  -.2727101   .0533916    -5.11   0.000    -.3773558   -.1680644
       _cons |     .76416    .534335     1.43   0.153    -.2831173    1.811437
------------------------------------------------------------------------------

这给

library(AER)
library(sampleSelection)
data(Affairs)
Affairs$affair = Affairs$affairs > 0
Affairs$male = Affairs$gender == 'male'
reg = probit(affair ~ male + age + yearsmarried + children + religiousness +
           education + rating, data=Affairs)
print(coeftest(reg, vcovCL), digits=6)

使用这些函数，都可以计算最大似然概率估计，并且都可以计算可靠的标准误差。顺便说一句：向Estimate Std. Error t value Pr(>|t|) (Intercept) 0.7641600 0.5343350 1.43011 0.1532109 maleTRUE 0.1888175 0.1319271 1.43123 0.1528921 age -0.0243996 0.0111237 -2.19347 0.0286608 * yearsmarried 0.0546080 0.0189634 2.87966 0.0041248 ** childrenyes 0.2080755 0.1662426 1.25164 0.2111955 religiousness -0.1860854 0.0532405 -3.49519 0.0005091 *** education 0.0155052 0.0263552 0.58832 0.5565446 rating -0.2727101 0.0533916 -5.10773 4.4012e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1程序包的作者致敬，该程序包（IMO）确实清除了R中的标准错误计算。

在R中复制具有强大错误的Stata Probit

3 个答案: