删除R中的字符级别异常值

时间:2014-03-24 21:54:11

标签: r lm outliers

我有一个线性model1<-lm(divorce_rate~marriage_rate+median_age+population),其杠杆图显示28的异常值(&#34的状态变量id;内华达&#34;)。我想在数据集中指定没有Nevada的模型。我试过以下但是卡住了。

data<-read.dta("census.dta")
attach(data)
data1<-data.frame(pop,divorce,marriage,popurban,medage,divrate,marrate)
attach(data1)
model1<-lm(divrate~marrate+medage+pop,data=data1)
summary(model1)
layout(matrix(1:4,2,2))
plot(model1)
dfbetaPlots(lm(divrate~marrate+medage+pop),id.n=50)
vif(model1)

dataNV<-data[!data$state == "Nevada",]
attach(dataNV)
model3<-lm(divrate~marrate+medage+pop,data=dataNV)

上面代码的最后一行给了我

Error in model.frame.default(formula = divrate ~ marrate + medage + pop,  : 
  variable lengths differ (found for 'medage')

enter image description here

1 个答案:

答案 0 :(得分:1)

我怀疑你的代码中有一些小问题,以至于你的环境中仍然存在附着()ed副本 - 这就是为什么使用{{ 非常好的做法1}}。以下代码适用于我:

attach()

我在数据集中没有找到library(foreign) ## best not to call data 'data' mydata <- read.dta("http://www.stata-press.com/data/r8/census.dta") divrate:我将推测您想要人均费率:

marrate

在干净的会话中,这对我来说很好:

## best practice to use a new name rather than transforming 'in place'
mydata2 <- transform(mydata,marrate=marriage/pop,divrate=divorce/pop)
model1 <- lm(divrate~marrate+medage+pop,data=mydata2)
library(car)
plot(model1)
dfbetaPlots(model1)

或者您可以使用dataNV <- subset(mydata2,state != "Nevada") ## update() may be nice to avoid repeating details of the ## model specification (not really necessary in this case) model3 <- update(model1,data=dataNV) 参数:

subset