Question

假设我有一个分类向量，并且我想编写一个函数来获取所有可能的类别子集，因此我可以对向量进行重新分类。

以4个不同类别（1 | 2 | 3 | 4）为例，如果我要将函数应用于它，它应该生成15个具有不同数量的新类别的子集（在下面{{ 1}}分隔新类别，而|分隔新类别中的旧类别：

1个新类别：（1,2,3,4）
2个新类别：（1 | 2,3,4）
2个新类别：（2 | 1,3,4）
2个新类别：（3 | 1,2,4）
2个新类别：（4 | 1,2,3）
2个新类别：（1,2 | 3,4）
2个新类别：（1,3 | 2,4）
2个新类别：（1,4 | 2,3）
3个新类别：（1 | 2 | 3,4）
3个新类别：（1 | 3 | 2,4）
3个新类别：（1 | 4 | 2,3）
3个新类别：（2 | 3 | 1,4）
3个新类别：（2 | 4 | 1,3）
3个新类别：（3 | 4 | 1,2）
4个新类别：（1 | 2 | 3 | 4）

我设法编写了一个可以完成工作的函数（注意：这个函数不能直接在,的类别上工作，相反，它会生成并处理索引返回的对象vec是一个嵌套列表，其长度等于可能的组合数，combins的每个元素都是一个列表，其中包含旧类别的分组索引，每个组由整数向量。）：

combins

当类别数量低于4时，它可以正常工作，但当数量大于此数量时，它会变得非常慢：

choose_cat <- function(vec) {
  cats <- unique(vec[!is.na(vec)]) # categories of the vector excluding NA
  num_cat <- length(cats) # number of categories
  splits <- unlist(lapply(seq_len(num_cat), function(x) { # split the categories into all possible pieces of various lengths
    combn(num_cat, x, simplify = F)
  }), recursive = F)
  combins <- lapply(seq_len(num_cat), function(y) { # choose from the pieces to constitute new combinations
    combn(splits, y, function(z) {
      set <- unlist(z)
      if ((!any(duplicated(set))) & (setequal(set, cats))) { # remove invalid combinations (those with duplicated categories or those missing any categories)
        z
      } else {
        NULL
      }
    }, simplify = F)
  })
  combins <- unlist(combins, recursive = F)
  combins <- combins[!vapply(combins, is.null, logical(1))]
  combins
}

我尝试使用> system.time(choose_cat(1:4)) user system elapsed 0.03 0.00 0.03 > system.time(choose_cat(1:5)) user system elapsed 53.94 0.00 53.93包中的parLapply()来替换函数中的第二个parallel调用，但情况变得更糟。

有没有人有更高效的算法及其在R中的实现可以加快这个过程？

开发一种更快的算法来计算R

0 个答案: