R中不平衡数据集的循环回归(使用应用函数)

时间:2014-05-02 12:21:47

标签: r for-loop apply linear-regression lapply

我有100个不同国家的数据集,每个国家有5个变量。对于每个国家/地区,我想进行线性回归并在之后存储结果。主要问题是,对于某些国家,我没有一些变量的数据。

我的数据集具有以下结构:

set.seed(1)
Q <- as.data.frame(matrix(rnorm(360),9,40))
colnames(Q)[1]<- "Country"
colnames(Q)[2]<- "Variable"
colnames(Q)[3:40] <- paste(1900:1937)
Q[1:3,1] <- "CountryA"
Q[4:6,1] <- "CountryB"
Q[7:9,1] <- "CountryC"
Q[1:3,2] <- paste("var",1:3,sep="")
Q[4:6,2] <- paste("var",1:3,sep="")
Q[7:9,2] <- paste("var",1:3,sep="")

对于每个国家/地区,我都希望进行回归:

lm(var1~var2+var3)

1。平衡数据集的示例

我的方法如下:

# subset the data set for wach country (if someone knows an easier approach, please tell me)
datasets <- list(NA)
j <- 1
for(cat in unique(Q$Country)){
  sub <- subset(Q, Country==cat, select=c(2:40))
  sub1 <- as.data.frame(t(sub))
  colnames(sub1) <- sub[,1 ]
  sub1 <- sub1[-1, ]
  sub1$var1 <- as.numeric(as.character(sub1$var1)) 
  sub1$var2 <- as.numeric(as.character(sub1$var2))
  sub1$var3 <- as.numeric(as.character(sub1$var3))
  sub1 <- sub1[,colSums(is.na(sub1))<nrow(sub1)]
  datasets[[j]] <- sub1
  j <- j+1

}

# apply linear regression to each dataset
regressions <-  llply(datasets, lm, formula = var1 ~.)

# extract coefficients from regressions
coefs <- ldply(regressions, coef)

这没问题:

>coefs
   (Intercept)       var2       var3

1 0.0009635977  0.1627555 -0.1738419

2 0.2571188803 -0.3548750 -0.0248167

3 0.1109881052 -0.0722544  0.1439666

2。带有不平衡数据集的示例

现在,我将缺少的变量添加到数据集中:

# Add missing variables: 
Q[2,3:40] <- rep(NA) 
Q[6,3:40] <- rep(NA)

如果我再次执行步骤1的循环,我会收到一条错误消息(代码工作正常,但最后一个语句coefs <- ldply(regressions, coef)失败):

[...]
> coefs <- ldply(regressions, coef)
Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) : 
  Results do not have equal lengths

我的问题:如何以某种方式修改代码,使其适用于不平衡的数据集(缺少某些变量)?

感谢您提供任何帮助或建议!

1 个答案:

答案 0 :(得分:2)

用零替换全部为NA的列:

Coef <- function(x) {
    DF <- setNames(as.data.frame(t(x[-(1:2)])), x$Variable)
    DF[colSums(is.na(DF)) == nrow(DF)] <- 0
    coef(lm(var1 ~., DF))
}
do.call(rbind, by(Q, Q$Country, Coef))

,并提供:

         (Intercept)        var2       var3
CountryA  0.01863015          NA -0.1982462
CountryB  0.26296826 -0.35416216         NA
CountryC  0.11098809 -0.07225439  0.1439667