我有一个约100,000个唯一ID的列表,我需要将它们分配到3个列表中,以便每个列表都能获得约33,000个。
棘手的部分是,每个列表都有约20k不能使用的唯一ID:排除列表。这3个排除清单重叠15%-50%,并且大小各异,但最后,排除之后,原始清单中有足够多的原始清单供33%使用。
biglist = [] #100k elements
a_exc = [] #15k elements in common w/biglist
b_exc = [] #25k elements in common w/biglist
c_exc = [] #30k elements in common w/biglist
# function to distribute biglist into a_list, b_list, and c_list
# such that no element in a_list is in a_exc, etc.
# but all elements in biglist are distributed if possible not in all 3 exc lists
# and a/b/c are equal in size or as close to equal as possible
由于排除列表重叠,所以它不像按顺序分发到每个列表那样简单。对于它的价值,我有一堆必须解决,我需要迭代地将它们加总。在某些情况下,排除列表分别占较大列表的〜50%,并且可以相互重叠,最大不超过〜50%。
这里有一些测试代码可以显示出速度的1/10的问题(在我的CPU上花100k和30k花费了一段时间)。运行此命令时,所有3个元素始终得到〜3333、3333、2450,这与为较大的列表运行时获得的点差相似。
import random
def lst_maker(num):
l = []
for i in range(num):
a = random.randint(1000000000, 9999999999)
while a in l:
a = random.randint(1000000000, 9999999999)
l.append(a)
return l
def exc_maker(inl, num):
l = []
for i in range(num):
a = random.choice(inl)
while a in l:
a = random.choice(inl)
l.append(a)
return l
biglist = lst_maker(10000)
a_exc = exc_maker(biglist, 3000)
b_exc = exc_maker(biglist, 3000)
c_exc = exc_maker(biglist, 3000)
def distribute_3(lst):
lst = set(lst)
lst = list(lst)
ll = len(lst)//3
random.shuffle(lst)
a = []
b = []
c = []
for e in lst:
if e not in a_exc and len(a) < ll:
a.append(e)
elif e not in b_exc and len(b) < ll:
b.append(e)
elif e not in c_exc and len(c) < ll:
c.append(e)
return a, b, c
a_list, b_list, c_list = distribute_3(biglist)
print len(a_list), len(b_list), len(c_list)
答案 0 :(得分:1)
除了将项目分散到列表中之外,该问题还存在三个主要的复杂问题:
因此,一种可行的解决方案是仅尝试将项目添加到可供他们使用的第一个列表中,但是如果遇到问题,则仅回溯到以前的添加-如果最终遇到问题,之前添加的内容等
在功能语言中,这种类型的回溯可以在递归函数中很好地实现,但是由于Python的最大递归深度非常有限,因此迭代方法可能会更好-特别是考虑到数据集的大小。 / p>
这是我的解决方法:
# generate list of identifiers
biglist = list(range(20))
# arbitrary exclusions, with some duplication
a_exc = [0, 2, 8, 15]
b_exc = [1, 3, 4, 6, 12]
c_exc = [0, 1, 5, 6, 7, 9, 1, 0]
def distribute(xs, n, exclusions):
# will distribute the contents of list xs over n lists, excluding items from exclusions[m] for the m-th list
# returns a list of lists (destructive, so xs will be empty after execution, pass in a copy to avoid)
# initialise result lists
result = [set() for _ in range(n)]
# calculate maximum size for each of the list for a balanced distribution
result_size = len(xs) // n
if len(xs) % n > 0:
result_size += 1
# initialise a list of additions, to allow for backtracking; recursion would be cleaner,
# but your dataset is too large an Python is not a functional language that is optimised for this
additions = []
# add all xs to the lists, trying the list in order, backtracking if lists fill up
while xs:
# get the last element from the list
x = xs.pop()
# find a place to add it, starting at the first list
i = 0
while True:
while i < n:
# find a list that's not full and can take x
if len(result[i]) < result_size and x not in exclusions[i]:
# add it
result[i].add(x)
# remember this exact addition
additions.append((i, x))
break
i += 1
# if x could not be added (due to exclusions and full lists)
if i == n:
# put current x back at the end of the list
xs.append(x)
# go back to the previous x
i, x = additions.pop(-1)
# take it out from the list it was put into
result[i].remove(x)
# try putting it in the next list available
i += 1
else:
break
return result
spread_lists = distribute(biglist, 3, [a_exc, b_exc, c_exc])
print(spread_lists)
还有优化的空间,但是我确实认为这可行。
实际上,在生成了一些较大的测试集之后,我发现了算法需要优化,这实际上非常简单:按照与之匹配的排除项的数量对输入列表进行排序。因此,先处理n
次被排除的标识符,再处理n-1
次被排除的标识符,等等。
这会将以下行添加到distribute
的开头:
# sort the input by most exclusions, most exclusions last, as list is processed in reverse order
xs = [x for _, x in sorted([([x in exc for exc in exclusions].count(True), x) for x in xs])]
如果不希望这样做,它的另一个优点是不再清空xs
。