Question

我有一个程序，对于选定的迭代次数，它随机选择LETTERS中的N个元素（无替换），并将所有迭代组合成一个主df。我在程序中添加了一个“唯一性”算法，用于检查当前迭代中与先前所有迭代相比存在多少个LETTERS的不同元素。基本上，我希望每次运行都与其他运行“不同”。

例如，如果当前运行选择c(A, J, C, Y, W)，而前一次运行为c(K, M, Z, A, I)，则不同字母的数量将为4，因为每个字母中都显示“A”。如果4> “唯一阈值”，然后将其添加到主df，否则跳到下一次迭代。

我想清楚以下代码确实有效，对于大型迭代，它只会变得非常缓慢。最明显的原因是因为对于每次迭代i in 1:n，当前i需要检查i-1次迭代。随着i变大，每次迭代都会花费更长时间。

查看我的可重现代码，是否有人可以提供有关如何加快速度的建议？是否有衡量“唯一性”的策略，不涉及检查每一次以前的运行？谢谢你的帮助。

library(dplyr)

df <- data.frame()
Run <- 100 # number of iterations
numProducts <- 5 # number of LETTERS to choose at random for each run
UniqueThresh <- 2 # i.e. need to have at least 2 different than any other
for (i in 1:Run) {
  # Make random "Product List", put into temp df with Run ID
  products <- sample(LETTERS, numProducts, replace = F)
  dfTemp <- data.frame(RunID = rep(i, numProducts), products)

  # Test uniqueness (pseudo code):
  #   Get all ID's currently in `df`. For those runIDs:
  #       Count how many LETTERS in dfTemp are in run[i] (`NonUnique`).
  #         if Unique LETTERS >= UniqueThresh  THEN rbind ELSE break unique-test and go to next i
  if (i > 1) {
    flag <- TRUE
    RunIDList <- distinct(df, RunID) %>% pull()
    for (runi in RunIDList) {
      # Filter main df on current `runi`
      dfUniquei <- df %>% filter(RunID == runi)
      # Count how many in products in current `i` are in df[runi]
      NonUnique <- sum(dfTemp$products %in% dfUniquei$products)
      TotalUnique <- numProducts - NonUnique

      # If unique players is less than threshold then flag as bad and break out of current runi for-loop to jump to next i
      if (TotalUnique < UniqueThresh) {
        flag <- FALSE
        break
      }
    }
    # If "not unique enough" then don't add to main `df` and skip to next run
    if(!flag) next
  }

  df <- rbind(df, dfTemp)
}

Answer 1

我使用group_by和summarise来比较当前的“产品列表”与过去的每个ID，而不是遍历数据框中的每个ID。如果列表中的唯一字母数大于numProducts+UniqueThresh-1，我们可以假设它们具有至少2个（在这种情况下）不同于该特定ID的字母。

library(dplyr)

Run <- 100 # number of iterations
numProducts <- 5 # number of LETTERS to choose at random for each run
UniqueThresh <- 2 # i.e. need to have at least 2 different than any other

#initialize: the first set will automatically be accepted.
df <- data.frame(RunID = rep(1, numProducts), prods = sample(LETTERS, numProducts, replace = F))

for (i in 2:Run) {
  # Make random "Product List"
  products <- sample(LETTERS, numProducts, replace = F)
  # Test uniqueness:
  # If "not unique enough" then don't add to main `df` and skip to next run
  if(df %>% group_by(RunID) %>% summarise(test = length(unique(c(as.character(prods),products)))>(numProducts+UniqueThresh-1)) %>% pull(test) %>% all){
    df <- rbind(df, data.frame(RunID = rep(i, numProducts), prods = products))}
}

根据“唯一性”要求选择向量

1 个答案: