Question

我正面临贝叶斯网络的问题，希望找到答案。我将尝试分解我的问题以增进您的理解！

目标：训练完模型后，鉴于变量X（读取矩阵A），我想预测变量Y（读取矩阵B）。例如：

给出变量 A，B，C 中的证据，我想预测变量 D，E，F

通过这种方式，您可以看到我正在搜索多变量预测！

问题或分解：：目前我在R中使用软件包 bnlearn ，问题是：最差的模型（一个空的随机图）有时会比我的最佳模型（从数据中学到的）更好的准确性，或者看起来非常接近我的最佳模型！而且更糟糕的是，测试集位于我的训练集中，因此从技术上讲，我最好的模型应执行〜100％，并且不会发生！

我想知道我的实现中是否存在任何数学错误（我逐步解释了代码），以及如何解释这种情况...

PS1：我知道拆分数据是必要的，但是在我的真实项目中，我的行数不足，更改数据集（大小）可能会更改最终模型

PS2：包装中不存在多变量精度，因此是手动创建的

这是我的带有注释的代码：

library(bnlearn)
# Dataframe
al <- alarm

# Nodes that I'm going to use as evidence to my models
nodeEvid <- names(al)[-c(30,31,32,33,34,35,36,37)]

# Nodes that I'm going to use as events to my models
nodeEvnt <- names(al)[c(30,31,32,33,34,35,36,37)]

## Best Model - Using all my data to create the arcs
bn_k2 <- tabu(x = al, score = 'k2')

## Worst Models
# Empty model
bn_eg <- empty.graph(names(al))

# Random model
bn_rd <- random.graph(names(al))

# Fitting the models ...
modelsBN = list(bn.fit(x = bn_k2, data = al), 
bn.fit(x = bn_eg, data = al),
bn.fit(x = bn_rd, data = al))

# Seed
set.seed(7)

# Selecting randomly lines to create our dTest
trainRows <- sample(1:nrow(al),as.integer(nrow(al)*0.30) , replace=F)

# Dataframe for test
dTest <- al[trainRows,]
# ACCURACY - CPDIST TO MULTI-VAR

# Dataframe to keep all the results in the end
accuracyCPD <- setNames(data.frame(matrix(ncol = length(nodeEvnt) + 1, nrow =     length(modelsBN))), c(nodeEvnt,"TOTAL MEAN ACCURACY BY MODEL"))

# Process to calculate ...
for (m in 1:length(modelsBN)){ # For every m bayesian model that I created
  # predCPD is a dataframe generated to keep the results to each sample run, I will explain more ahead...
  predCPD <- setNames(data.frame(matrix(ncol = length(nodeEvnt), nrow = nrow(dTest))), nodeEvnt)

  for (i in seq(nrow(dTest))){ # For i samples in my dTest
    #cpdist is a function that returns a dataframe of predictions based on conditional probability distribution from the model, with the rows being n value and the columns being
    # the nodeEvnt. So I will save his results in a dataframe called 'teste'
    teste <- cpdist(modelsBN[[m]], nodes = nodeEvnt, evidence = as.list(dTest[i, names(dTest) %in% nodeEvid]), n = 1000, method = "lw")
    # Here I use predCPD to calculate a % of how many times was returned the TRUE value/rows from my teste dataframe (tries from my model), this will be done to each variable 
    # from nodeEvnt
    for (j in 1:length(nodeEvnt)){ # Gerar media de acertos para j fatores bioticos
      predCPD[i,nodeEvnt[j]] <- sum(teste[nodeEvnt[j]] == as.character(dTest[i,nodeEvnt[j]]), na.rm = TRUE)/nrow(teste)
    }
  }

  # Here I do a 'Mean from means' (because predCPD is technically a mean) after my dTest is done, so accCPD have the results to that m model
  accCPD <- colMeans(predCPD, na.rm = TRUE)
  # Here I just multiply by 100 to put in the format 0 - 100 %
  for (j in 1:length(nodeEvnt)){
    accuracyCPD[m,nodeEvnt[j]] <- accCPD[nodeEvnt[j]]*100
  }
  # Here I do a mean from my target variables to save in "TOTAL MEAN ACCURACY BY MODEL"
  accuracyCPD[m,length(nodeEvnt) + 1] <- mean(accCPD)*100
}

我真的很想解决这种情况，我认为答案应该在CPT中，但这只是一个尝试...

贝叶斯网络：多重精度情况

0 个答案: