具有模型约束的R中的逐步回归

时间:2019-05-06 14:11:16

标签: r regression linear-regression

我估计了一些国际流量的引力模型,而不是使用始发国x*_o,目的国x*_d和一组距离变量x*的数据。现在,我想看看是否可以使用逐步模型选择找到更简洁的模型。我的数据看起来像这样:

set.seed(450)
data <- data.frame(dep = rnorm(20, 6, 2),
                   x1_o = rnorm(20, 0, 1),
                   x1_d = rnorm(20, 5, 3),
                   x2_o = rnorm(20, 5, 3),
                   x2_d = rnorm(20, 5, 3),
                   x3_o = rnorm(20, 5, 3),
                   x3_d = rnorm(20, 5, 3),
                   x4 = rnorm(20, 5, 3),
                   x5 = rnorm(20, 5, 3),
                   x6 = rnorm(20, 5, 3))

拟合线性模型并逐步回归:

lm_fit <- lm(dep ~ ., data = data)
step_fit <- step(lm_fit, direction = "both")
summary(step_fit)

结果:

Call:
lm(formula = dep ~ x1_d + x2_d + x3_o + x3_d + x4 + x6, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-1.962 -1.003  0.213  0.550  1.955 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.4525     1.5384   6.144 3.52e-05 ***
x1_d         -0.1615     0.1141  -1.416  0.18039    
x2_d         -0.8532     0.2105  -4.053  0.00137 ** 
x3_o         -0.1334     0.1011  -1.320  0.20969    
x3_d          0.2332     0.1319   1.768  0.10055    
x4            0.2830     0.1304   2.170  0.04914 *  
x6           -0.1729     0.1123  -1.539  0.14776    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.411 on 13 degrees of freedom
Multiple R-squared:  0.595, Adjusted R-squared:  0.4081 
F-statistic: 3.183 on 6 and 13 DF,  p-value: 0.0379

您会看到step删除了来源国的x1x2变量,但保留了目的地国的变量。我想要实现的是,step始终为始发国和目的地国保留或删除变量。例如,x1_ox1_d应该同时为in或全部为out。

在R中这可能吗? scope参数提供了对模型选择施加一些约束的选项,但是我不确定是否可以使用该选项做我想做的事情。

谢谢。

1 个答案:

答案 0 :(得分:1)

将每个成对的列定义为nrow(data)级别的因子,并且将2列定义为起点和终点。对于任何这样的因素,这将迫使它要么保留两个列要么拒绝两个列。最后使用注释中的数据(与问题中的数据相同,不同之处在于随机种子已被修改,因此答案是因素和剩余列的混合:

nr <- nrow(data)
data2 <- transform(data, 
  x1 = C(factor(1:nr), cbind(x1_o, x1_d), 2),
  x2 = C(factor(1:nr), cbind(x2_o, x2_d), 2),
  x3 = C(factor(1:nr), cbind(x3_o, x3_d), 2))

fm <- lm(dep ~ x1 + x2 + x3 + x4 + x5 + x6, data2)
dim(model.matrix(fm)) # check dimensions of model matrix
step(fm)

注意

set.seed(13)
data <- data.frame(dep = rnorm(20, 6, 2),
                   x1_o = rnorm(20, 0, 1),
                   x1_d = rnorm(20, 5, 3),
                   x2_o = rnorm(20, 5, 3),
                   x2_d = rnorm(20, 5, 3),
                   x3_o = rnorm(20, 5, 3),
                   x3_d = rnorm(20, 5, 3),
                   x4 = rnorm(20, 5, 3),
                   x5 = rnorm(20, 5, 3),
                   x6 = rnorm(20, 5, 3))