我有一个程序,对于选定的迭代次数,它随机选择LETTERS
中的N个元素(无替换),并将所有迭代组合成一个主df
。我在程序中添加了一个“唯一性”算法,用于检查当前迭代中与先前所有迭代相比存在多少个LETTERS
的不同元素。基本上,我希望每次运行都与其他运行“不同”。
例如,如果当前运行选择c(A, J, C, Y, W)
,而前一次运行为c(K, M, Z, A, I)
,则不同字母的数量将为4,因为每个字母中都显示“A”。如果4> “唯一阈值”,然后将其添加到主df,否则跳到下一次迭代。
我想清楚以下代码确实有效,对于大型迭代,它只会变得非常缓慢。最明显的原因是因为对于每次迭代i in 1:n
,当前i
需要检查i-1
次迭代。随着i
变大,每次迭代都会花费更长时间。
查看我的可重现代码,是否有人可以提供有关如何加快速度的建议?是否有衡量“唯一性”的策略,不涉及检查每一次以前的运行?谢谢你的帮助。
library(dplyr)
df <- data.frame()
Run <- 100 # number of iterations
numProducts <- 5 # number of LETTERS to choose at random for each run
UniqueThresh <- 2 # i.e. need to have at least 2 different than any other
for (i in 1:Run) {
# Make random "Product List", put into temp df with Run ID
products <- sample(LETTERS, numProducts, replace = F)
dfTemp <- data.frame(RunID = rep(i, numProducts), products)
# Test uniqueness (pseudo code):
# Get all ID's currently in `df`. For those runIDs:
# Count how many LETTERS in dfTemp are in run[i] (`NonUnique`).
# if Unique LETTERS >= UniqueThresh THEN rbind ELSE break unique-test and go to next i
if (i > 1) {
flag <- TRUE
RunIDList <- distinct(df, RunID) %>% pull()
for (runi in RunIDList) {
# Filter main df on current `runi`
dfUniquei <- df %>% filter(RunID == runi)
# Count how many in products in current `i` are in df[runi]
NonUnique <- sum(dfTemp$products %in% dfUniquei$products)
TotalUnique <- numProducts - NonUnique
# If unique players is less than threshold then flag as bad and break out of current runi for-loop to jump to next i
if (TotalUnique < UniqueThresh) {
flag <- FALSE
break
}
}
# If "not unique enough" then don't add to main `df` and skip to next run
if(!flag) next
}
df <- rbind(df, dfTemp)
}
答案 0 :(得分:3)
我使用group_by
和summarise
来比较当前的“产品列表”与过去的每个ID,而不是遍历数据框中的每个ID。如果列表中的唯一字母数大于numProducts+UniqueThresh-1
,我们可以假设它们具有至少2个(在这种情况下)不同于该特定ID的字母。
library(dplyr)
Run <- 100 # number of iterations
numProducts <- 5 # number of LETTERS to choose at random for each run
UniqueThresh <- 2 # i.e. need to have at least 2 different than any other
#initialize: the first set will automatically be accepted.
df <- data.frame(RunID = rep(1, numProducts), prods = sample(LETTERS, numProducts, replace = F))
for (i in 2:Run) {
# Make random "Product List"
products <- sample(LETTERS, numProducts, replace = F)
# Test uniqueness:
# If "not unique enough" then don't add to main `df` and skip to next run
if(df %>% group_by(RunID) %>% summarise(test = length(unique(c(as.character(prods),products)))>(numProducts+UniqueThresh-1)) %>% pull(test) %>% all){
df <- rbind(df, data.frame(RunID = rep(i, numProducts), prods = products))}
}