我正面临贝叶斯网络的问题,希望找到答案。我将尝试分解我的问题以增进您的理解!
目标:训练完模型后,鉴于变量X(读取矩阵A),我想预测变量Y(读取矩阵B)。例如:
给出变量 A,B,C 中的证据,我想预测变量 D,E,F
通过这种方式,您可以看到我正在搜索多变量预测!
问题或分解::目前我在R中使用软件包 bnlearn ,问题是:最差的模型(一个空的随机图)有时会比我的最佳模型(从数据中学到的)更好的准确性,或者看起来非常接近我的最佳模型!而且更糟糕的是,测试集位于我的训练集中,因此从技术上讲,我最好的模型应执行〜100%,并且不会发生!
我想知道我的实现中是否存在任何数学错误(我逐步解释了代码),以及如何解释这种情况...
PS1:我知道拆分数据是必要的,但是在我的真实项目中,我的行数不足,更改数据集(大小)可能会更改最终模型
PS2:包装中不存在多变量精度,因此是手动创建的
这是我的带有注释的代码:
library(bnlearn)
# Dataframe
al <- alarm
# Nodes that I'm going to use as evidence to my models
nodeEvid <- names(al)[-c(30,31,32,33,34,35,36,37)]
# Nodes that I'm going to use as events to my models
nodeEvnt <- names(al)[c(30,31,32,33,34,35,36,37)]
## Best Model - Using all my data to create the arcs
bn_k2 <- tabu(x = al, score = 'k2')
## Worst Models
# Empty model
bn_eg <- empty.graph(names(al))
# Random model
bn_rd <- random.graph(names(al))
# Fitting the models ...
modelsBN = list(bn.fit(x = bn_k2, data = al),
bn.fit(x = bn_eg, data = al),
bn.fit(x = bn_rd, data = al))
# Seed
set.seed(7)
# Selecting randomly lines to create our dTest
trainRows <- sample(1:nrow(al),as.integer(nrow(al)*0.30) , replace=F)
# Dataframe for test
dTest <- al[trainRows,]
# ACCURACY - CPDIST TO MULTI-VAR
# Dataframe to keep all the results in the end
accuracyCPD <- setNames(data.frame(matrix(ncol = length(nodeEvnt) + 1, nrow = length(modelsBN))), c(nodeEvnt,"TOTAL MEAN ACCURACY BY MODEL"))
# Process to calculate ...
for (m in 1:length(modelsBN)){ # For every m bayesian model that I created
# predCPD is a dataframe generated to keep the results to each sample run, I will explain more ahead...
predCPD <- setNames(data.frame(matrix(ncol = length(nodeEvnt), nrow = nrow(dTest))), nodeEvnt)
for (i in seq(nrow(dTest))){ # For i samples in my dTest
#cpdist is a function that returns a dataframe of predictions based on conditional probability distribution from the model, with the rows being n value and the columns being
# the nodeEvnt. So I will save his results in a dataframe called 'teste'
teste <- cpdist(modelsBN[[m]], nodes = nodeEvnt, evidence = as.list(dTest[i, names(dTest) %in% nodeEvid]), n = 1000, method = "lw")
# Here I use predCPD to calculate a % of how many times was returned the TRUE value/rows from my teste dataframe (tries from my model), this will be done to each variable
# from nodeEvnt
for (j in 1:length(nodeEvnt)){ # Gerar media de acertos para j fatores bioticos
predCPD[i,nodeEvnt[j]] <- sum(teste[nodeEvnt[j]] == as.character(dTest[i,nodeEvnt[j]]), na.rm = TRUE)/nrow(teste)
}
}
# Here I do a 'Mean from means' (because predCPD is technically a mean) after my dTest is done, so accCPD have the results to that m model
accCPD <- colMeans(predCPD, na.rm = TRUE)
# Here I just multiply by 100 to put in the format 0 - 100 %
for (j in 1:length(nodeEvnt)){
accuracyCPD[m,nodeEvnt[j]] <- accCPD[nodeEvnt[j]]*100
}
# Here I do a mean from my target variables to save in "TOTAL MEAN ACCURACY BY MODEL"
accuracyCPD[m,length(nodeEvnt) + 1] <- mean(accCPD)*100
}
我真的很想解决这种情况,我认为答案应该在CPT中,但这只是一个尝试...