Question

关注question 我使用以下代码：

dist<-c('att1','att2','att3','att4','att5','att6')
p1<-c('att1','att5','att2')
p2<-c('att5','att1','att4')
p3<-c('att3','att4','att2')
p4<-c('att1','att2','att3')
p5<-c('att6')
....
p32<-c('att35','att34','att32')

在实际情况中，可以有1024个向量。我想找到所有相关的p，它们的统一将是dist的最大组成部分。在这种情况下，解决方案将是p1，p3，p5。我想选择最小数量的p。另外，如果没有办法覆盖所有的dist组件，所以我想选择具有最小向量数的最大覆盖（p）。

N = 32
library(qdapTools)
library(dplyr)
library(data.table)
## generate matrix of attributes
attribute_matrix <- mtabulate(list(p1, p2, p3, p4, p5,...,p32))

library (bigmemory)
## generate matrix of attributes
grid_matrix <- do.call(CJ, rep(list(1:0), N))  %>% as.big.matrix

Error: cannot allocate vector of size 8.0 Gb

我尝试了另一种方法：

grid_matrix <- do.call(CJ, rep(list(1:0), N))  %>% as.data.frame
grid_matrix <- as.matrix (grid_matrix)

仍然有同样的错误。

如何修复它并将其用于大数据？我想继续：

 colnames(grid_matrix) <- paste0("p", 1:N)
    combin_all_element_present <- rowSums(grid_matrix %*% attribute_matrix > 0) %>% `==`(., ncol(attribute_matrix))
    grid_matrix_sub <- grid_matrix[combin_all_element_present, ]
    grid_matrix_sub[rowSums(grid_matrix_sub) == min(rowSums(grid_matrix_sub)), ]

Answer 1

这被称为集合覆盖问题。它可以使用整数线性编程来解决。令x1，x2，...为0/1变量（每个p变量一个）并表示p1，p2，...为0/1向量P1，P2，...和dist为 0/1向量D.然后问题可以表示为：

min x1 + x2 + ... + x32
such that
P1 * x1 + P2 + x2 + ... + P32 * x32 >= D

在R代码中如下。首先使用排序顺序的p向量创建列表p。使用mixedsort，以便在p3之后p32结束而不是rigth。将attnames定义为所有p向量中所有att名称的集合。然后制定目标函数（其等于封面中的p的数量），约束矩阵（由P向量作为列组成）和约束方程的右手侧（其为dist作为0/1向量）。最后运行整数线性程序并将解决方案从0/1向量转换为p名称的向量。

library(gtools)
library(lpSolve)

p <- mget(mixedsort(ls(pattern = "^p\\d+$")))
attnames <- mixedsort(unique(unlist(p)))
objective <- rep(1L, length(p))
const.mat <- sapply(p, function(x) attnames %in% x) + 0L
const.rhs <- (attnames %in% dist) + 0L

ans <- lp("min", objective, const.mat, ">=", const.rhs, all.bin = TRUE)
names(p)[ans$solution == 1L]
## [1] "p2" "p4" "p5"

约束矩阵的每个attnames条目都有一行，每个p向量有一列。

该解决方案生成attnames中dist个元素的最小覆盖率。如果dist的每个元素都出现在至少一个p向量中，则该解决方案将代表dist的封面。如果不是，该解决方案将代表p中的一个或多个dist向量中的那些名称的封面;因此，这处理了问题中讨论的两种情况。 dist的未覆盖元素是：

setdiff(dist, attnames)

因此如果长度为零，则解决方案代表dist的完整封面。如果不是，解决方案代表

的封面

intersect(dist, attnames)

在代码中完成的排序并不是非常需要，但通过使约束矩阵的行和列按逻辑顺序排列，可以更容易地处理优化的各种输入。

注意：在运行上述代码之前运行此问题的代码：

dist<-c('att1','att2','att3','att4','att5','att6')
p1<-c('att1','att5','att2')
p2<-c('att5','att1','att4')
p3<-c('att3','att4','att2')
p4<-c('att1','att2','att3')
p5<-c('att6')
p32<-c('att35','att34','att32')

Answer 2

已经提供的答案是完美的，但另一种方法可能如下：

dist<-c('att1','att2','att3','att4','att5','att6')
p1<-c('att1','att5','att2')
p2<-c('att5','att1','att4')
p3<-c('att3','att4','att2')
p4<-c('att1','att2','att3')
p5<-c('att6')


library(qdapTools)
library(data.table)
attribute_matrix <- mtabulate(list(p1, p2, p3, p4, p5))


minimal_sets <- function(superset, subsets_matrix, p){

  setDT(subsets_matrix)
  # removing the columns that are not in the superset
  updated_sub_matr <- subsets_matrix[, which(names(subsets_matrix) %in% superset), with = F]

  # initializing counter for iterations and the subset selected 
  subset_selected <- integer(0)
  counter <- p

  ## Loop until either we ran out of iterations counter = 0 or we found the solution
  while (counter > 0 & length(superset) > 0){

    ## find the row with the most matches with the superset we want to achieve  
    max_index <- which.max(rowSums(updated_sub_matr))

    ## remove from the superset the entries that match that line and from the subsets_matrix those columns as they dont contribute anymore
    superset <- superset[which(updated_sub_matr[max_index, ] == 0)]
    updated_sub_matr <- updated_sub_matr[, - which(updated_sub_matr[max_index, ] != 0), with = F]

    counter <- counter - 1
    subset_selected <- c(subset_selected, max_index)
  }

  if (length(superset) > 0){
    print(paste0("No solution found, there are(is) ", length(superset), " element(s) left ", paste(superset, collapse = "-")))            
  } else {            
    print(paste0("Found a solution after ", p - counter, " iterations"))           
  }

  print(paste0("Selected the following subsets: ", paste(subset_selected, collapse = "-")))

}

在此功能中，您输入您的超集（在本例中为dist），您要检查的attribute_matrix和数字p，它会输出找到的最佳解决方案以及迭代。

> minimal_sets(dist, attribute_matrix, 1)
[1] "No solution found, there are(is) 3 element(s) left att3-att4-att6"
[1] "Selected the following subsets: 1"

> minimal_sets(dist, attribute_matrix, 3)
[1] "Found a solution after 3 iterations"
[1] "Selected the following subsets: 1-3-5"

> minimal_sets(dist, attribute_matrix, 5)
[1] "Found a solution after 3 iterations"
[1] "Selected the following subsets: 1-3-5

大数据列表的覆盖次数最少

2 个答案: