我知道标题有点含糊。请阅读更多详细信息。
我有一组长度可变的已知集合(如10000个),每个集合都是英文字母的子集。看起来像这样:
a = ['a', 'b', 'c', 'a']
b = ['c', 'd', 'a', 'b']
c = ['x', 'y', 'z']
....
unique_value = set((*a, *b, *c, ...))
# {'a', 'b', 'c', 'd', 'e', 'f', ..., 'u', 'v', 'w', 'x', 'y', 'z'}
我需要从10000个以上的集合中选择一个固定的集合编号(如100),其中该子集包含所有英文字符,并且每个字符的计数尽可能为 balance
< / b>。 balance
表示字符分布均匀。我知道很难选择完全均匀的分布,因此定义balance criteria
也很重要。
请向我建议一种实现此目标的方法。任何建议将不胜感激。
预先感谢!
答案 0 :(得分:1)
我将尝试的一般算法是一种概率算法。我将创建一个从字符到subset_ids的反向查找表,然后继续添加和删除子集以平衡固定数目的子集的+ 0 / + 1。当添加子集时,我将添加一个随机选择的子集,该子集包含填充最少的字母,而当删除子集时,我将从包含最多填充字母的子集中选择一个子集。还应该有很小的机会“变异”并选择一个完全随机的子集进行添加/删除,以防止卡在局部最小值中。
我尝试编写此解决方案的代码,但随着我修复一些极端的情况和错误,它很快降级为一些意大利面条式的代码。它远非完美的解决方案,甚至可能返回错误的答案,但至少它可以为您提供一些想法。
# Make lookup table
lookup = defaultdict(set)
for idx, subset in enumerate(subsets):
for character in subset:
lookup[character].add(idx)
best_score, best_subsets = 1, None
size = 10 # number of subsets to pick
subset_indices = set() # subset_ids
character_subsets = defaultdict(set) # subset_ids per letter
# loop some large number of times
for _ in range(10000):
if len(subset_indices) > size: # remove elements
idx = choice(list(subset_indices)) # maybe pick a random
if random() < 0.9: # 90% chance pick an existing subset to remove
indices = max(character_subsets.values(), key=len) # indices to pick from
idx = choice(list(indices)) # pick one
for character in subsets[idx]: # remove index/subset_id from lookup
character_subsets[character].remove(idx)
subset_indices.remove(idx) # remove subset_id from random draw pool
else: # add a new subset
idx = choice(list(set(range(len(subsets))) - subset_indices)) # invert random selection
if random() < 0.9: # 90% chance to pick a new subset from the min populated
i, indices = min(character_subsets.items(), key=lambda x:len(x[1]), default=(randint(0, len(lookup)-1),set()))
indices = lookup[i] - indices # invert
if not indices: continue # abort if empty
idx = choice(list(indices)) # pick
for character in subsets[idx]:
character_subsets[character].add(idx) # update dict
subset_indices.add(idx) # update random selection set
score = pstdev(map(len, character_subsets.values())) # measure distribution
if score < best_score and len(subset_indices) == size: # if better
best_subsets = dict(character_subsets) # record it
best_score = score
# do logic to pretty-print or process best_subset however you like