贝叶斯网络:多重精度情况

时间:2018-08-22 19:58:22

标签: r bayesian-networks bnlearn

我正面临贝叶斯网络的问题,希望找到答案。我将尝试分解我的问题以增进您的理解!

目标:训练完模型后,鉴于变量X(读取矩阵A),我想预测变量Y(读取矩阵B)。例如:

给出变量 A,B,C 中的证据,我想预测变量 D,E,F

通过这种方式,您可以看到我正在搜索多变量预测!

问题或分解::目前我在R中使用软件包 bnlearn ,问题是:最差的模型(一个空的随机图)有时会比我的最佳模型(从数据中学到的)更好的准确性,或者看起来非常接近我的最佳模型!而且更糟糕的是,测试集位于我的训练集中,因此从技术上讲,我最好的模型应执行〜100%,并且不会发生!

我想知道我的实现中是否存在任何数学错误(我逐步解释了代码),以及如何解释这种情况...

PS1:我知道拆分数据是必要的,但是在我的真实项目中,我的行数不足,更改数据集(大小)可能会更改最终模型

PS2:包装中不存在多变量精度,因此是手动创建的

这是我的带有注释的代码:

library(bnlearn)
# Dataframe
al <- alarm

# Nodes that I'm going to use as evidence to my models
nodeEvid <- names(al)[-c(30,31,32,33,34,35,36,37)]

# Nodes that I'm going to use as events to my models
nodeEvnt <- names(al)[c(30,31,32,33,34,35,36,37)]

## Best Model - Using all my data to create the arcs
bn_k2 <- tabu(x = al, score = 'k2')

## Worst Models
# Empty model
bn_eg <- empty.graph(names(al))

# Random model
bn_rd <- random.graph(names(al))

# Fitting the models ...
modelsBN = list(bn.fit(x = bn_k2, data = al), 
bn.fit(x = bn_eg, data = al),
bn.fit(x = bn_rd, data = al))

# Seed
set.seed(7)

# Selecting randomly lines to create our dTest
trainRows <- sample(1:nrow(al),as.integer(nrow(al)*0.30) , replace=F)

# Dataframe for test
dTest <- al[trainRows,]
# ACCURACY - CPDIST TO MULTI-VAR

# Dataframe to keep all the results in the end
accuracyCPD <- setNames(data.frame(matrix(ncol = length(nodeEvnt) + 1, nrow =     length(modelsBN))), c(nodeEvnt,"TOTAL MEAN ACCURACY BY MODEL"))

# Process to calculate ...
for (m in 1:length(modelsBN)){ # For every m bayesian model that I created
  # predCPD is a dataframe generated to keep the results to each sample run, I will explain more ahead...
  predCPD <- setNames(data.frame(matrix(ncol = length(nodeEvnt), nrow = nrow(dTest))), nodeEvnt)

  for (i in seq(nrow(dTest))){ # For i samples in my dTest
    #cpdist is a function that returns a dataframe of predictions based on conditional probability distribution from the model, with the rows being n value and the columns being
    # the nodeEvnt. So I will save his results in a dataframe called 'teste'
    teste <- cpdist(modelsBN[[m]], nodes = nodeEvnt, evidence = as.list(dTest[i, names(dTest) %in% nodeEvid]), n = 1000, method = "lw")
    # Here I use predCPD to calculate a % of how many times was returned the TRUE value/rows from my teste dataframe (tries from my model), this will be done to each variable 
    # from nodeEvnt
    for (j in 1:length(nodeEvnt)){ # Gerar media de acertos para j fatores bioticos
      predCPD[i,nodeEvnt[j]] <- sum(teste[nodeEvnt[j]] == as.character(dTest[i,nodeEvnt[j]]), na.rm = TRUE)/nrow(teste)
    }
  }

  # Here I do a 'Mean from means' (because predCPD is technically a mean) after my dTest is done, so accCPD have the results to that m model
  accCPD <- colMeans(predCPD, na.rm = TRUE)
  # Here I just multiply by 100 to put in the format 0 - 100 %
  for (j in 1:length(nodeEvnt)){
    accuracyCPD[m,nodeEvnt[j]] <- accCPD[nodeEvnt[j]]*100
  }
  # Here I do a mean from my target variables to save in "TOTAL MEAN ACCURACY BY MODEL"
  accuracyCPD[m,length(nodeEvnt) + 1] <- mean(accCPD)*100
}

我真的很想解决这种情况,我认为答案应该在CPT中,但这只是一个尝试...

0 个答案:

没有答案