我正在尝试对数据框中的数据组(县)运行回归(lm)。但是,我首先想要过滤该数据帧(dat)以排除一些数据点太少的组。只要我不首先对数据框进行子集化,我就能让一切工作正常:
tmp1 <- with(dat,
by(dat, County,
function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp1, function(x) summary(x)$adj.r.squared)
我按预期回来了:
Barrow Carroll Cherokee Clayton Cobb Dekalb Douglas
0.00000 NaN 0.61952 0.69591 0.48092 0.61292 0.39335
但是,当我第一次对数据框进行子集时:
dat.counties <- aggregate(dat[,"County"], by=list(County), FUN=length)
good.counties <- as.matrix(subset(dat.counties, x > 20, select=Group.1))
dat.temp <- dat["County" %in% good.counties,]
然后运行相同的代码:
tmp2 <- with(dat,
by(dat, County,
function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)
我收到以下错误:“$运算符对原子向量无效”。如果我然后跑
summary(tmp2)
我看到以下内容:
Length Class Mode
Barrow 0 -none- NULL
Carroll 0 -none- NULL
Cherokee 12 lm list
Clayton 12 lm list
sapply显然是对Class -none-对象的轰炸。但那些是我上面排除的那些!它们如何仍然出现在我的新数据框中?!
感谢您的任何启发。
答案 0 :(得分:1)
代码的某些部分不清楚。可能是你做了attach
数据集。此外,由@BrodieG评论使用错误的dat
代替dat.temp
也存在问题。关于错误,可能是因为列County
为factor
而levels
未被删除。你可以试试
dat.temp1 <- droplevels(dat.temp)
tmp2 <- with(dat.temp1,
by(dat.temp1, County,
function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)
以下是重现错误的示例
set.seed(24)
d <- data.frame(
state = rep(c('NY', 'CA','MD', 'ND'), c(10,10,6,7)),
year = sample(1:10,33,replace=TRUE),
response= rnorm(33)
)
tmp1 <- with(d, by(d, state, function(x) lm(formula=response~year, data=x)))
sapply(tmp1, function(x) summary(x)$adj.r.squared)
# CA MD ND NY
# 0.03701114 -0.04988296 -0.07817515 -0.11850038
d.states <- aggregate(d[,"state"], by=list(d[,'state']), FUN=length)
good.states <- as.matrix(subset(d.states, x > 6, select=Group.1))
d.sub <- d[d$state %in% good.states[,1],]
tmp2 <- with(d.sub,
by(d.sub, state,
function(x) lm(formula = response~year, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)
#Error in summary(x)$adj.r.squared :
# $ operator is invalid for atomic vectors
如果你看一下
tmp2[2]
#$MD
#NULL
d.sub1 <- droplevels(d.sub)
tmp2 <- with(d.sub1,
by(d.sub1, state,
function(x) lm(formula = response~year, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)
# CA ND NY
# 0.03701114 -0.07817515 -0.11850038