缺少lm中的数据行为:即使对于没有丢失数据的预测变量,也会使用完整的案例

时间:2017-12-11 12:35:51

标签: r na lm

我的问题:使用NA删除预测变量的最有效方法是什么,并考虑排除预测变量的完整案例?

问题来自NA s的以下回归情况,其中Ozone(主要是)和Solar.R中缺少值。

data(airquality)
summary(airquality)
#     Ozone           Solar.R           Wind             Temp           Month      
# Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00   Min.   :5.000  
# 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00   1st Qu.:6.000  
# Median : 31.50   Median :205.0   Median : 9.700   Median :79.00   Median :7.000  
# Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88   Mean   :6.993  
# 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00   3rd Qu.:8.000  
# Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00   Max.   :9.000  
# NA's   :37       NA's   :7                                                       
#      Day      
# Min.   : 1.0  
# 1st Qu.: 8.0  
# Median :16.0  
# Mean   :15.8  
# 3rd Qu.:23.0  
# Max.   :31.0  

对剩余变量进行Wind回归。仅考虑完整的案例。

summary(lm(Wind ~ ., data = airquality))
# 
# Call:
# lm(formula = Wind ~ ., data = airquality)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -4.3908 -2.2800 -0.3078  1.4132  9.6501 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 15.519460   2.918393   5.318 5.96e-07 ***
# Ozone       -0.060746   0.011798  -5.149 1.23e-06 ***
# Solar.R      0.003791   0.003216   1.179    0.241    
# Temp        -0.036604   0.044576  -0.821    0.413    
# Month       -0.159671   0.208082  -0.767    0.445    
# Day          0.017353   0.031238   0.556    0.580    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 2.822 on 105 degrees of freedom
#   (42 observations deleted due to missingness)
# Multiple R-squared:  0.3994,  Adjusted R-squared:  0.3708 
# F-statistic: 13.96 on 5 and 105 DF,  p-value: 1.857e-10

如果删除Ozone,仍会仅考虑完整案例(包含Ozone)。但这与手动删除Ozone不同。

summary(lm(Wind ~ . - Ozone, data = airquality))
# 
# Call:
# lm(formula = Wind ~ . - Ozone, data = airquality)
# 
# Residuals:
#    Min     1Q Median     3Q    Max 
# -6.012 -2.323 -0.361  1.493  9.605 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 24.3159074  2.6354288   9.227 3.09e-15 ***
# Solar.R      0.0009228  0.0035281   0.262    0.794    
# Temp        -0.1900820  0.0369159  -5.149 1.21e-06 ***
# Month        0.0313046  0.2280600   0.137    0.891    
# Day          0.0008969  0.0346116   0.026    0.979    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.143 on 106 degrees of freedom
#   (42 observations deleted due to missingness)
# Multiple R-squared:  0.2477,  Adjusted R-squared:  0.2193 
# F-statistic: 8.727 on 4 and 106 DF,  p-value: 3.961e-06

summary(lm(Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality))
# 
# Call:
# lm(formula = Wind ~ Solar.R + Temp + Wind + Month + Day, data = airquality)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -8.1779 -2.2063 -0.2757  1.9448  9.3510 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 23.660271   2.416766   9.790  < 2e-16 ***
# Solar.R      0.002980   0.003113   0.957    0.340    
# Temp        -0.186386   0.032725  -5.695 6.89e-08 ***
# Month        0.074952   0.206334   0.363    0.717    
# Day         -0.011028   0.030304  -0.364    0.716    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 3.158 on 141 degrees of freedom
#   (7 observations deleted due to missingness)
# Multiple R-squared:  0.2125,  Adjusted R-squared:  0.1901 
# F-statistic: 9.511 on 4 and 141 DF,  p-value: 7.761e-07

1 个答案:

答案 0 :(得分:3)

Wind ~ . - Ozone在查找完整案件时考虑Ozone确实令人遗憾和惊讶;如果你想追求它,似乎值得在r-devel@r-project.org邮件列表上进行讨论。在此期间,怎么样

 summary(lm(Wind ~ ., data = subset(airquality, select=-Ozone))