如何重新编码缺失的数据,以便我的变量长度在R中相同

时间:2014-01-17 01:19:21

标签: r linear-regression missing-data variable-length

所以我有两个变量,即Verbal(SATV)Quantitative(SATQ)中的SAT分数。有500行。 NA's中有7个SATQ。我的目标是将lm()gvlma()SATVSATQ作为IVs运行。
但是我得到一个错误,说R不会运行我的代码,因为我从NAs省略了SATQ,现在我的变量长度不同了。如何重新编码NA's以使我的变量保持相同的长度 忽略非正常数据和违反假设。 (我也不知道我在R中做了什么,所以如果你可以提供建议,假装你和那些对R一无所知的人说话)

> summary(SATQ)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  200.0   525.0   610.0   604.5   700.0   800.0       7 

> SATQ2<-na.omit(SATQ)

> summary(SATQ2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  200.0   525.0   610.0   604.5   700.0   800.0 

> summary(SATV)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  200.0   537.5   600.0   604.4   690.0   800.0

> summary(ms)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  765.6  1844.0  2133.0  2093.0  2395.0  2877.0 

> #ms= monthly salary
> m1 = lm(ms~SATV+SATQ2)
Error in model.frame.default(formula = ms ~ SATV + SATQ2, drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'SATQ2')

> m1 = lm(ms~SATV+SATQ2)
Error in model.frame.default(formula = ms ~ SATV + SATQ2, drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'SATQ2')

> summary(m1)

Call:
lm(formula = ms ~ SATV + SATQ2)

Residuals:
     Min       1Q   Median       3Q      Max 
-1551.58   -12.48    45.32    99.77   168.46 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  57.5656    55.1658   1.044    0.297    
SATV          1.4313     0.1030  13.890   <2e-16 ***
SATQ2         1.9350     0.1025  18.871   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 206.3 on 490 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.7419,    Adjusted R-squared:  0.7409 
F-statistic: 704.3 on 2 and 490 DF,  p-value: < 2.2e-16

> gvlma(m1)

Call:
lm(formula = ms ~ SATV + SATQ2)

Coefficients:
(Intercept)         SATV        SATQ2  
     57.566        1.431        1.935  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = m1) 

                       Value  p-value                   Decision
Global Stat        7.904e+03 0.00e+00 Assumptions NOT satisfied!
Skewness           1.261e+03 0.00e+00 Assumptions NOT satisfied!
Kurtosis           6.593e+03 0.00e+00 Assumptions NOT satisfied!
Link Function      2.317e-02 8.79e-01    Assumptions acceptable.
Heteroscedasticity 5.036e+01 1.28e-12 Assumptions NOT satisfied!

1 个答案:

答案 0 :(得分:1)

可能最简单的选择是执行以下操作:

dta = data.frame(SATV=SATV, SATQ=SATQ, ms = ms)
lm(ms ~ SATV + SATQ, data = na.omit(dta))

将逐行删除NAs。