我正在尝试编写一个并行化for循环,其中我正在尝试最佳地找到最佳GLM以仅模拟具有最低p值的变量以查看我是否打算打网球(是/否)在二进制)。
例如,我有一张表(及其数据帧),其中包含气象数据集。我通过查看这些模型中的哪一个是最低的p值
来构建GLM模型PlayTennis ~ Precip
PlayTennis ~ Temp,
PlayTennis ~ Relative_Humidity
PlayTennis ~ WindSpeed)
我们说PlayTennis ~ Precip
具有最低的p值。因此,重复中的下一个循环迭代是查看其他变量将具有最低p值。
PlayTennis ~ Precip + Temp
PlayTennis ~ Precip + Relative_Humidity
PlayTennis ~ Precip + WindSpeed
这将持续到没有更重要的变量(P值大于0.05)。因此,我们得到PlayTennis ~ Precip + WindSpeed
的最终输出(这都是假设的)。
是否有关于如何在各种内核上并行化此代码的建议?我从库speedglm中遇到了一个名为speedglm
的glm的新函数。这确实有所改善,但不是很多。我也查看了foreach
循环,但我不确定如何与每个线程通信以了解各个运行的p值是更大还是更低。预先感谢您的任何帮助。
d =
Time Precip Temp Relative_Humidity WindSpeed … PlayTennis
1/1/2000 0:00 0 88 30 0 1
1/1/2000 1:00 0 80 30 1 1
1/1/2000 2:00 0 70 44 0 1
1/1/2000 3:00 0 75 49 10 0
1/1/2000 4:00 0.78 64 99 15 0
1/1/2000 5:00 0.01 66 97 15 0
1/1/2000 6:00 0 74 88 8 0
1/1/2000 7:00 0 77 82 1 1
1/1/2000 8:00 0 78 70 1 1
1/1/2000 9:00 0 79 71 1 1
我的代码如下:
newNames <- names(d)
FRM <- "PlayTennis ~"
repeat
{
for (i in 1:length(newNames))
{
frm <- as.formula(paste(FRM, newNames[i], sep =""))
GLM <- glm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
data = d, family = binomial())
# GLM <- speedglm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
# data = d, family = binomial())
temp <- coef(summary(GLM))[,4][counter]
if (i == 1) # assign min p value, location, and variable name to the first iteration
{
MIN <- temp
LOC <- i
VAR <- newNames[i]
}
if (temp < MIN) # adjust the min p value accordingly
{
MIN <- temp
LOC <- i
VAR <- newNames[i]
}
}
if(MIN > 0.05) # break out of the repeat loop when the p-value > 0.05
{
break
}
FRM <- paste(FRM, VAR, " + ", sep = "") # create new formula
newNames <- newNames[which(newNames != VAR)] # removes variable that is the most significant
counter <- counter + 1
}
我已经尝试过但没有工作的代码
newNames <- names(d)
FRM <- "PlayTennis ~"
repeat
{
foreach (i = 1:length(newNames)) %dopar%
{
frm <- as.formula(paste(FRM, newNames[i], sep =""))
GLM <- glm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
data = d, family = binomial())
# GLM <- speedglm(formula = frm, na.action = na.exclude, # exclude NA values where they exist
# data = d, family = binomial())
temp <- coef(summary(GLM))[,4][counter]
if (i == 1) # assign min p value, location, and variable name to the first iteration
{
MIN <- temp
LOC <- i
VAR <- newNames[i]
}
if (temp < MIN) # adjust the min p value accordingly
{
MIN <- temp
LOC <- i
VAR <- newNames[i]
}
}
if(MIN > 0.05) # break out of the repeat loop when the p-value > 0.05
{
break
}
FRM <- paste(FRM, VAR, " + ", sep = "") # create new formula
newNames <- newNames[which(newNames != VAR)] # removes variable that is the most significant
counter <- counter + 1
}