我有100个不同国家的数据集,每个国家有5个变量。对于每个国家/地区,我想进行线性回归并在之后存储结果。主要问题是,对于某些国家,我没有一些变量的数据。
我的数据集具有以下结构:
set.seed(1)
Q <- as.data.frame(matrix(rnorm(360),9,40))
colnames(Q)[1]<- "Country"
colnames(Q)[2]<- "Variable"
colnames(Q)[3:40] <- paste(1900:1937)
Q[1:3,1] <- "CountryA"
Q[4:6,1] <- "CountryB"
Q[7:9,1] <- "CountryC"
Q[1:3,2] <- paste("var",1:3,sep="")
Q[4:6,2] <- paste("var",1:3,sep="")
Q[7:9,2] <- paste("var",1:3,sep="")
对于每个国家/地区,我都希望进行回归:
lm(var1~var2+var3)
1。平衡数据集的示例
我的方法如下:
# subset the data set for wach country (if someone knows an easier approach, please tell me)
datasets <- list(NA)
j <- 1
for(cat in unique(Q$Country)){
sub <- subset(Q, Country==cat, select=c(2:40))
sub1 <- as.data.frame(t(sub))
colnames(sub1) <- sub[,1 ]
sub1 <- sub1[-1, ]
sub1$var1 <- as.numeric(as.character(sub1$var1))
sub1$var2 <- as.numeric(as.character(sub1$var2))
sub1$var3 <- as.numeric(as.character(sub1$var3))
sub1 <- sub1[,colSums(is.na(sub1))<nrow(sub1)]
datasets[[j]] <- sub1
j <- j+1
}
# apply linear regression to each dataset
regressions <- llply(datasets, lm, formula = var1 ~.)
# extract coefficients from regressions
coefs <- ldply(regressions, coef)
这没问题:
>coefs
(Intercept) var2 var3
1 0.0009635977 0.1627555 -0.1738419
2 0.2571188803 -0.3548750 -0.0248167
3 0.1109881052 -0.0722544 0.1439666
2。带有不平衡数据集的示例
现在,我将缺少的变量添加到数据集中:
# Add missing variables:
Q[2,3:40] <- rep(NA)
Q[6,3:40] <- rep(NA)
如果我再次执行步骤1的循环,我会收到一条错误消息(代码工作正常,但最后一个语句coefs <- ldply(regressions, coef)
失败):
[...]
> coefs <- ldply(regressions, coef)
Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) :
Results do not have equal lengths
我的问题:如何以某种方式修改代码,使其适用于不平衡的数据集(缺少某些变量)?
感谢您提供任何帮助或建议!
答案 0 :(得分:2)
用零替换全部为NA的列:
Coef <- function(x) {
DF <- setNames(as.data.frame(t(x[-(1:2)])), x$Variable)
DF[colSums(is.na(DF)) == nrow(DF)] <- 0
coef(lm(var1 ~., DF))
}
do.call(rbind, by(Q, Q$Country, Coef))
,并提供:
(Intercept) var2 var3
CountryA 0.01863015 NA -0.1982462
CountryB 0.26296826 -0.35416216 NA
CountryC 0.11098809 -0.07225439 0.1439667