我从SQL数据库导入了一个大表,其结构与此示例表
类似testData <- data.frame(
BatchNo = c(1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3),
Y = c(1,1.247011378,1.340630851,1.319026357,1.41264583,1.093619473,1.38023909,1.473858563,1,1.093619473,1.038888089,1.081833061,1,1.215913383,1.278861891,1.297746443,1.360694952,1.332368123,1.414201183,1,1.081833061,1,1.063661202),
Categorical1 = c("A9","B5513","B5513","B5514","B5514","A9","B5514","B5514","A9","A9","B1723","A9","A9","B5513","B5514","B5513","B5514","B5514","B5514","A9","A9","A486","B1701"),
Categorical2 = c("A2793","B5512","B5512","B5512","B5512","B5508","B6623","B6623","B5508","B5508","B5508","A127","A127","B5515","B5515","B5515","B5515","B6623","B6623","A127","A127","A2727","A2727"),
Categorical3 = c("A5510","B5511","B5511","B5511","B5511","A5510","B5511","B5511","B5511","B5511","B5511","A5518","A5518","B5517","B5517","B5517","B5517","B5517","B5517","B5517","B5517","A2","A2"),
Categorical4 = c("A5","A5","B649","A5","B649","B649","A5","B649","A5","B649","A5","B649","A5","A5","A5","B649","B649","A5","B649","A5","B649","A649","A649"),
Binary1 = c(rep(0,times=23)),
Binary2 = c(rep(0,times=23)),
Binary3 = c(rep(0,times=23)),
Binary4 = c(rep(0,times=23))
)
我想在for循环中做的是:
1.基于BatchNo列(1到2500)创建子集数据帧
2.使用每个子集数据帧的健康线性模型
3.将系数估计列表导出回SQL表
n<-max(testData[,1])
for (i in 1:n) {
assign(paste("dat"),droplevels(subset(testData,BatchNo == i, select = 1:10)))
assign(paste("lm.", i, sep =
""),lm(Y~Categorical1+Categorical2+Categorical3+Categorical4+Binary1+Binary2+Binary3+Binary4,data=dat))}
问题在于会创建子集,其中4个分类变量中的至少一个(或者可能所有变量)将具有单个级别(如本示例中的BatchNo = 3),并且R不能在回归中使用这些变量。
对于二元预测变量,这不是问题,因为它只会产生N/A
系数估计值,并且我会在模型拟合后执行step(backward)
删除任何一个。
起初我尝试使用step(forward)
在每个循环中仅选择有意义的预测变量,但这并不起作用,因为我必须列出所有潜在的预测变量供选择。
我可以想到两种可能的解决方案:
lm
公式我只是创造了这两个载体:
factors<-dat[,3:6]
f<-names(factors)
levels<-c(length(levels(factors[,1])),length(levels(factors[,2])),length(levels(factors[,3])),length(levels(factors[,4])))
所以现在我只需要从&#34; f&#34; 中删除第n个元素,其中&#34; level&#34; 的第n个元素等于1。
答案 0 :(得分:1)
最终我已经找到了一种方法来做我想做的事情。可能有一种更简单/更优雅的方式,但我已经使用过:
l<-nrow(dat)
a<-length(levels(dat[,3]))
b<-length(levels(dat[,4]))
c<-length(levels(dat[,5]))
d<-length(levels(dat[,6]))
zeros<-c(rep(0,times=l))
if (a<2) dat[,2]<-zeros
if (b<2) dat[,3]<-zeros
if (c<2) dat[,4]<-zeros
if (d<2) dat[,5]<-zeros
单级因子被适当长度的每个循环包含零的向量所取代,因此可以运行回归而不会出错。
答案 1 :(得分:1)
试试这个:
do.call(rbind,
lapply(split(testData, testData$BatchNo), function(i){
#check if factor columns have more than 1 level
cats <- colnames(i)[c(3:6)[sapply(i[, c(3:6)], function(j){length(unique(j))}) > 1]]
cats <- paste(cats, collapse = "+")
fit <- lm(as.formula(paste0("Y~", cats, "+Binary2+Binary3+Binary4")), data = i)
#return coef as df
as.data.frame(coef(fit))
})
)
输出
# coef(fit)
# 1.(Intercept) 1.000000e+00
# 1.Categorical1B1723 3.888809e-02
# 1.Categorical1B5513 3.082241e-01
# 1.Categorical1B5514 3.802391e-01
# 1.Categorical2B5508 5.611389e-16
# 1.Categorical2B5512 -6.121273e-02
# 1.Categorical2B6623 NA
# 1.Categorical3B5511 1.699675e-17
# 1.Categorical4B649 9.361947e-02
# 1.Binary2 NA
# 1.Binary3 NA
# 1.Binary4 NA
# 2.(Intercept) 1.000000e+00
# 2.Categorical1B5513 2.694196e-01
# 2.Categorical1B5514 3.323681e-01
# 2.Categorical2B5515 -5.350623e-02
# 2.Categorical2B6623 NA
# 2.Categorical3B5517 3.289161e-18
# 2.Categorical4B649 8.183306e-02
# 2.Binary2 NA
# 2.Binary3 NA
# 2.Binary4 NA
# 3.(Intercept) 1.000000e+00
# 3.Categorical1B1701 6.366120e-02
# 3.Binary2 NA
# 3.Binary3 NA
# 3.Binary4 NA