Question

我正在尝试对数据框中的数据组（县）运行回归（lm）。但是，我首先想要过滤该数据帧（dat）以排除一些数据点太少的组。只要我不首先对数据框进行子集化，我就能让一切工作正常：

tmp1 <- with(dat, 
    by(dat, County,
        function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp1, function(x) summary(x)$adj.r.squared)

我按预期回来了：

Barrow Carroll Cherokee Clayton Cobb Dekalb Douglas

0.00000 NaN 0.61952 0.69591 0.48092 0.61292 0.39335

但是，当我第一次对数据框进行子集时：

dat.counties <- aggregate(dat[,"County"], by=list(County), FUN=length)
good.counties <- as.matrix(subset(dat.counties, x > 20, select=Group.1))
dat.temp <- dat["County" %in% good.counties,]

然后运行相同的代码：

tmp2 <- with(dat, 
by(dat, County,
    function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)

我收到以下错误：“$运算符对原子向量无效”。如果我然后跑 summary(tmp2)我看到以下内容：

     Length Class  Mode
Barrow 0 -none- NULL

Carroll 0 -none- NULL

Cherokee 12 lm list

Clayton 12 lm list

sapply显然是对Class -none-对象的轰炸。但那些是我上面排除的那些！它们如何仍然出现在我的新数据框中？！

感谢您的任何启发。

Answer 1

代码的某些部分不清楚。可能是你做了attach数据集。此外，由@BrodieG评论使用错误的dat代替dat.temp也存在问题。关于错误，可能是因为列County为factor而levels未被删除。你可以试试

dat.temp1 <- droplevels(dat.temp)
tmp2 <- with(dat.temp1, 
      by(dat.temp1, County,
      function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)

以下是重现错误的示例

set.seed(24)
d <- data.frame(
 state = rep(c('NY', 'CA','MD', 'ND'), c(10,10,6,7)),
 year = sample(1:10,33,replace=TRUE),
 response= rnorm(33)
)

 tmp1 <- with(d, by(d, state, function(x) lm(formula=response~year, data=x)))
 sapply(tmp1, function(x) summary(x)$adj.r.squared)
 #       CA          MD          ND          NY 
 # 0.03701114 -0.04988296 -0.07817515 -0.11850038 

d.states <- aggregate(d[,"state"], by=list(d[,'state']), FUN=length)
good.states <- as.matrix(subset(d.states, x > 6, select=Group.1))
d.sub <-  d[d$state %in% good.states[,1],]

tmp2 <- with(d.sub, 
    by(d.sub, state,
      function(x) lm(formula = response~year, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)
#Error in summary(x)$adj.r.squared : 
# $ operator is invalid for atomic vectors

如果你看一下

 tmp2[2]
 #$MD
 #NULL

d.sub1 <- droplevels(d.sub)
tmp2 <- with(d.sub1, 
      by(d.sub1, state,
          function(x) lm(formula = response~year, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)
#       CA          ND          NY 
# 0.03701114 -0.07817515 -0.11850038

R子集数据框中的错误然后使用sapply

1 个答案: