我估计了一些国际流量的引力模型,而不是使用始发国x*_o
,目的国x*_d
和一组距离变量x*
的数据。现在,我想看看是否可以使用逐步模型选择找到更简洁的模型。我的数据看起来像这样:
set.seed(450)
data <- data.frame(dep = rnorm(20, 6, 2),
x1_o = rnorm(20, 0, 1),
x1_d = rnorm(20, 5, 3),
x2_o = rnorm(20, 5, 3),
x2_d = rnorm(20, 5, 3),
x3_o = rnorm(20, 5, 3),
x3_d = rnorm(20, 5, 3),
x4 = rnorm(20, 5, 3),
x5 = rnorm(20, 5, 3),
x6 = rnorm(20, 5, 3))
拟合线性模型并逐步回归:
lm_fit <- lm(dep ~ ., data = data)
step_fit <- step(lm_fit, direction = "both")
summary(step_fit)
结果:
Call:
lm(formula = dep ~ x1_d + x2_d + x3_o + x3_d + x4 + x6, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.962 -1.003 0.213 0.550 1.955
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.4525 1.5384 6.144 3.52e-05 ***
x1_d -0.1615 0.1141 -1.416 0.18039
x2_d -0.8532 0.2105 -4.053 0.00137 **
x3_o -0.1334 0.1011 -1.320 0.20969
x3_d 0.2332 0.1319 1.768 0.10055
x4 0.2830 0.1304 2.170 0.04914 *
x6 -0.1729 0.1123 -1.539 0.14776
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.411 on 13 degrees of freedom
Multiple R-squared: 0.595, Adjusted R-squared: 0.4081
F-statistic: 3.183 on 6 and 13 DF, p-value: 0.0379
您会看到step
删除了来源国的x1
和x2
变量,但保留了目的地国的变量。我想要实现的是,step
始终为始发国和目的地国保留或删除变量。例如,x1_o
和x1_d
应该同时为in或全部为out。
在R中这可能吗? scope
参数提供了对模型选择施加一些约束的选项,但是我不确定是否可以使用该选项做我想做的事情。
谢谢。
答案 0 :(得分:1)
将每个成对的列定义为nrow(data)
级别的因子,并且将2列定义为起点和终点。对于任何这样的因素,这将迫使它要么保留两个列要么拒绝两个列。最后使用注释中的数据(与问题中的数据相同,不同之处在于随机种子已被修改,因此答案是因素和剩余列的混合:
nr <- nrow(data)
data2 <- transform(data,
x1 = C(factor(1:nr), cbind(x1_o, x1_d), 2),
x2 = C(factor(1:nr), cbind(x2_o, x2_d), 2),
x3 = C(factor(1:nr), cbind(x3_o, x3_d), 2))
fm <- lm(dep ~ x1 + x2 + x3 + x4 + x5 + x6, data2)
dim(model.matrix(fm)) # check dimensions of model matrix
step(fm)
set.seed(13)
data <- data.frame(dep = rnorm(20, 6, 2),
x1_o = rnorm(20, 0, 1),
x1_d = rnorm(20, 5, 3),
x2_o = rnorm(20, 5, 3),
x2_d = rnorm(20, 5, 3),
x3_o = rnorm(20, 5, 3),
x3_d = rnorm(20, 5, 3),
x4 = rnorm(20, 5, 3),
x5 = rnorm(20, 5, 3),
x6 = rnorm(20, 5, 3))