我的问题:使用NA
删除预测变量的最有效方法是什么,并考虑排除预测变量的完整案例?
问题来自NA
s的以下回归情况,其中Ozone
(主要是)和Solar.R
中缺少值。
data(airquality)
summary(airquality)
# Ozone Solar.R Wind Temp Month
# Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000
# 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000
# Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000
# Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993
# 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000
# Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000
# NA's :37 NA's :7
# Day
# Min. : 1.0
# 1st Qu.: 8.0
# Median :16.0
# Mean :15.8
# 3rd Qu.:23.0
# Max. :31.0
对剩余变量进行Wind
回归。仅考虑完整的案例。
summary(lm(Wind ~ ., data = airquality))
#
# Call:
# lm(formula = Wind ~ ., data = airquality)
#
# Residuals:
# Min 1Q Median 3Q Max
# -4.3908 -2.2800 -0.3078 1.4132 9.6501
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 15.519460 2.918393 5.318 5.96e-07 ***
# Ozone -0.060746 0.011798 -5.149 1.23e-06 ***
# Solar.R 0.003791 0.003216 1.179 0.241
# Temp -0.036604 0.044576 -0.821 0.413
# Month -0.159671 0.208082 -0.767 0.445
# Day 0.017353 0.031238 0.556 0.580
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 2.822 on 105 degrees of freedom
# (42 observations deleted due to missingness)
# Multiple R-squared: 0.3994, Adjusted R-squared: 0.3708
# F-statistic: 13.96 on 5 and 105 DF, p-value: 1.857e-10
如果删除Ozone
,仍会仅考虑完整案例(包含Ozone
)。但这与手动删除Ozone
不同。
summary(lm(Wind ~ . - Ozone, data = airquality))
#
# Call:
# lm(formula = Wind ~ . - Ozone, data = airquality)
#
# Residuals:
# Min 1Q Median 3Q Max
# -6.012 -2.323 -0.361 1.493 9.605
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 24.3159074 2.6354288 9.227 3.09e-15 ***
# Solar.R 0.0009228 0.0035281 0.262 0.794
# Temp -0.1900820 0.0369159 -5.149 1.21e-06 ***
# Month 0.0313046 0.2280600 0.137 0.891
# Day 0.0008969 0.0346116 0.026 0.979
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 3.143 on 106 degrees of freedom
# (42 observations deleted due to missingness)
# Multiple R-squared: 0.2477, Adjusted R-squared: 0.2193
# F-statistic: 8.727 on 4 and 106 DF, p-value: 3.961e-06
summary(lm(Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality))
#
# Call:
# lm(formula = Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality)
#
# Residuals:
# Min 1Q Median 3Q Max
# -8.1779 -2.2063 -0.2757 1.9448 9.3510
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 23.660271 2.416766 9.790 < 2e-16 ***
# Solar.R 0.002980 0.003113 0.957 0.340
# Temp -0.186386 0.032725 -5.695 6.89e-08 ***
# Month 0.074952 0.206334 0.363 0.717
# Day -0.011028 0.030304 -0.364 0.716
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 3.158 on 141 degrees of freedom
# (7 observations deleted due to missingness)
# Multiple R-squared: 0.2125, Adjusted R-squared: 0.1901
# F-statistic: 9.511 on 4 and 141 DF, p-value: 7.761e-07
答案 0 :(得分:3)
Wind ~ . - Ozone
在查找完整案件时考虑Ozone
确实令人遗憾和惊讶;如果你想追求它,似乎值得在r-devel@r-project.org
邮件列表上进行讨论。在此期间,怎么样
summary(lm(Wind ~ ., data = subset(airquality, select=-Ozone))