Question

我正在尝试模拟matching与回归（OLS）的敏感性，但是我在某处做错了事，因为我无法使用matching来检索真实模型。

我正在生成3个变量：x，背景特征，d，它是处理变量（二进制）和y的结果。 d与x关联。匹配的思想是，一旦以x为条件，处理分配生成过程就和随机性一样好。在回归世界中，x只是一个控制变量。我想测试当数据中存在非共同支持区域（在某些值以上或以下均未进行处理）时，回归如何执行。

library(tidyverse)
library(Matching)
library(MatchIt)

N = 1000
# generate random variable normality dist #
x = rnorm(N, 0, 5)

这就是我在x和d（二进制）之间生成关联的方式。

# generate Treatement associated with x, with different probailities after a certain threshold #
d = ifelse(x > 0.7, rbinom(0.7 * N, 1, 0.6) , rbinom( (1 - 0.7) * N, 1, 0.3) )
# beyond 0.7 the proba is 0.6 to receive treatment and below is 0.3 #

对我来说似乎是正确的，但是如果您有更好的方法，请告诉我。

# adding a bit more randomness #
d[ sample(length(d), 100) ] <- rbinom(100, 1, 0.5)

# also adding a cut-off point for the treated #  
d[x < -10] <- 0
d[x > 10] <- 0

我正在使用d产生ifelse的效果，对我来说似乎是正确的，但是我可能是错的。

# generate outcome y, w/ polyn relationship with x and a Treatment effect of 15 # sd == 10 #
y = x*1 + x^2 + rnorm(N, ifelse(d == 1, 15, 0), 10)

#
df = cbind(x,d,y) %>% as.data.frame()
# check out the "common support"
df %>% ggplot(aes(x, y, colour = factor(d) )) + geom_point()
#

该图显示了我要为3个关系建模的方式。注意治疗后的临界值在10以上。

现在，当我用OLS估计d对y的影响时，变量省略的模型和预期的错误指定模型给了我d的不正确估计。

# omitted x #
lm(y ~ d, df) %>% summary()
# misspecification #
lm(y ~ d + x, df) %>% summary()
# true model #

虽然正确的规范使我15（d的真实效果）。

lm(y ~ d + poly(x,2), df) %>% summary()
# we correctly retrieve 15 #

现在，我的问题是要了解为什么我无法使用匹配的软件包到达15（d的真实效果）。

使用MatchIt软件包。

我尝试使用mahalanobis和这样的倾向得分：

m1 = matchit(d ~ x, df, distance = 'mahalanobis', method = 'genetic')
m2a = matchit(d ~ x, df, distance = 'logit', method = 'genetic')
m2b = matchit(d ~ x + I(x^2), df, distance = 'logit', method = 'genetic')

匹配数据

mat1 = match.data(m1)
mat2a = match.data(m2a)
mat2b = match.data(m2b)

# OLS #
lm(y ~ d, mat1) %>% summary()
lm(y ~ d, mat2a) %>% summary()
lm(y ~ d, mat2b) %>% summary()

因此，这里我不检索15。为什么？我会误解结果吗？我的印象是，在进行matching时，您不必建模多项式项或/和交互。那不对吗？

lm(y ~ d + poly(x,2), mat1) %>% summary()
lm(y ~ d + poly(x,2), mat2a) %>% summary()
lm(y ~ d + poly(x,2), mat2b) %>% summary()

因为如果我在此处包含poly(x,2)一词，则会检索15。

使用Matching软件包，我也得到了完全不同的估算值

x1 = df$x
gl = glm(d ~ x + I(x^2), df, family = binomial)
x1 = gl$fitted.values

# I thought that it could be because OLS only gives ATE #
m0 = Match(Y = y, Tr = d, X = x1, estimand = 'ATE')
# but no 
m0$est

有任何线索吗？

Answer 1

匹配过程的重要输出是对照观测值的权重。计算权重，以便在治疗组和对照组中倾向得分的分布相似（一旦施加权重）。

对于您而言，这意味着（从dgp开始并带有符号）：

lm(y ~ d, mat1, weights = weights) %>% summary()
lm(y ~ d, mat2a, weights = weights) %>% summary()
lm(y ~ d, mat2b, weights = weights) %>% summary()

我们到了：15又回来了（实际上是14.9）！

匹配。使用MatchIt和Matching进行数据模拟和估计。如何获取真实模型？

1 个答案: