我有一个列表,如我附上的代码所示。如果有任何常见值,我想链接每个子列表。然后,我想用列表的精简列表替换列表列表。 示例:如果我有一个列表[[1,2,3],[3,4]]
,我想要[1,2,3,4]
。如果我[[4,3],[1,2,3]]
,我想要[4,3,1,2]
。如果我[[1,2,3],[a,b],[3,4],[b,c]]
我需要[[1,2,3,4],[a,b,c]]
或[[a,b,c],[1,2,3,4]]
我不关心哪一个。
我快到了......
我的问题是当我遇到[[1,2,3],[10,5],[3,8,5]]
我希望[1,2,3,10,5,8]
这样的案例时,我的当前代码会[1,2,3,8,10,5]
这是我的代码:
import itertools
a = [1,2,3]
b = [3,4]
i = [21,22]
c = [88,7,8]
e = [5,4]
d = [3, 50]
f = [8,9]
g= [9,10]
h = [20,21]
lst = [a,b,c,i,e,d,f,g,h,a,c,i]*1000
#I have a lot of list but not very many different lists
def any_overlap(a, b):
sb = set(b)
return any(itertools.imap(sb.__contains__, a))
def find_uniq(lst):
''' return the uniq parts of lst'''
seen = set()
seen_add = seen.add
return [ x for x in lst if x not in seen and not seen_add(x)]
def overlap_inlist(o_lst, lstoflst):
'''
Search for overlap, using "any_overlap", of a list( o_lst) in a list of lists (lstoflst).
If there is overlap add the uniq part of the found list to the search list, and keep
track of where that list was found
'''
used_lst =[ ]
n_lst =[ ]
for lst_num, each_lst in enumerate(lstoflst):
if any_overlap(o_lst,each_lst):
n_lst.extend(each_lst)
used_lst.append(lst_num)
n_lst= find_uniq(n_lst)
return n_lst, used_lst
def comb_list(lst):
'''
For each list in a list of list find all the overlaps using 'ovelap_inlist'.
Update the list each time to delete the found lists. Return the final combined list.
'''
for updated_lst in lst:
n_lst, used_lst = overlap_inlist(updated_lst,lst)
lst[:] = [x for i,x in enumerate(lst) if i not in used_lst]
lst.insert(0,n_lst)
return lst
comb_lst = comb_list(lst)
print comb_lst
此脚本的输出是:
[[88, 7, 8, 9, 10], [1, 2, 3, 4, 50, 5], [21, 22, 20]]
我想要它所以关键是原始顺序,如:
[[88, 7, 8, 9, 10], [1, 2, 3, 4, 5, 50,], [21, 22, 20]]
新的lst [2]
中 5和50被切换我对python有些新意。我将不胜感激任何问题的解决方案或对我当前代码的评论。我不是计算机科学家,我想可能会有某种算法快速完成这项工作(可能来自集合理论?)。如果有这样的算法,请指出我正确的方向。
我可能会这样做更复杂然后...... 谢谢!!
答案 0 :(得分:13)
这是一种蛮力方法(可能更容易理解):
from itertools import chain
def condense(*lists):
# remember original positions
positions = {}
for pos, item in enumerate(chain(*lists)):
if item not in positions:
positions[item] = pos
# condense disregarding order
sets = condense_sets(map(set, lists))
# restore order
result = [sorted(s, key=positions.get) for s in sets]
return result if len(result) != 1 else result[0]
def condense_sets(sets):
result = []
for candidate in sets:
for current in result:
if candidate & current: # found overlap
current |= candidate # combine (merge sets)
# new items from candidate may create an overlap
# between current set and the remaining result sets
result = condense_sets(result) # merge such sets
break
else: # no common elements found (or result is empty)
result.append(candidate)
return result
>>> condense([1,2,3], [10,5], [3,8,5])
[1, 2, 3, 10, 5, 8]
>>> a = [1,2,3]
>>> b = [3,4]
>>> i = [21,22]
>>> c = [88,7,8]
>>> e = [5,4]
>>> d = [3, 50]
>>> f = [8,9]
>>> g= [9,10]
>>> h = [20,21]
>>> condense(*[a,b,c,i,e,d,f,g,h,a,c,i]*1000)
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
>>> condense([1], [2, 3, 2])
[[1], [2, 3]]
condense_*()
函数来自这个问题的答案。来自问题(不同大小)lst_OP
的{{1}}输入列表 - 来自@Blckknght's answer的测试列表(不同大小)。请参阅the source。
测量表明,基于“不相交集”和“无向图的连通分量”概念的解决方案在两种输入类型上的表现相似。
lst_BK
答案 1 :(得分:6)
这是我的方法。它使用“不相交集”的概念来首先识别哪些子列表相互重叠,然后将它们连接在一起,从而消除重复。
from collections import OrderedDict
def disjoint_set_find(djs, node): # disjoint set find, with path compression
if node not in djs: # base case, node is a root of a set
return node
djs[node] = disjoint_set_find(djs, djs[node]) # recurse, caching results
return djs[node]
def disjoint_set_union(djs, first, second):
first = disjoint_set_find(djs, first) # find root of first set
second = disjoint_set_find(djs, second) # and of second set
if first < second: # make smaller root the root of the new combined set
djs[second] = first
elif second < first:
djs[first] = second
# deliberately ignore the case where first == second (same set)
def condenseBK(*master_list):
values = OrderedDict() # maps values to the first sublist containing them
overlaps = {} # maps sublist indexes to each other to form a disjoint set
for i, sublist in enumerate(master_list):
for v in sublist:
if v not in values: # no overlap, so just store value
values[v] = i
else: # overlap detected, do a disjoint set union
disjoint_set_union(overlaps, values[v], i)
output = [] # results
output_map = {} # map from original indexes to output indexes
for v, i, in values.items(): # iterate over values in order
root = disjoint_set_find(overlaps, i)
if root not in output_map:
output_map[i] = len(output)
output.append([]) # create new output sublists as necessary
output[output_map[root]].append(v)
return output
示例输出:
>>> a = [1,2,3]
>>> b = [3,4]
>>> c = [88,7,8]
>>> d = [3, 50]
>>> e = [5,4]
>>> f = [8,9]
>>> g = [9,10]
>>> h = [20,21]
>>> i = [21,22]
>>> lst = [a,b,c,i,e,d,f,g,h,a,c,i]*1000
>>> condenseBK(*lst)
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
算法说明:
根据要求,这里是我的代码如何工作的解释。
前两个函数实现disjoint set的find
和union
操作。数据结构通过将节点映射到其父节点的字典来实现。任何不是字典键的节点都是集合的root
。 find
操作返回包含给定node
的集合的根节点。为了帮助提高性能,我实现了“路径压缩”,这减少了一段时间内所需的递归步骤数。 union
操作组合了包含其参数first
和second
的集合。
主condense
函数有两部分。首先,它设置了几个数据结构,然后使用它们来构建输出。
values
是一个OrderedDictionary,它将每个值映射到它所包含的第一个子列表的索引。每个值的添加顺序用于按正确的顺序生成输出。
overlaps
是用于不相交集的字典。它的节点是重叠子列表的整数索引。
第一个循环填满了这两个数据结构。它们遍历子列表及其内容。如果之前未看到某个值,则会将其添加到values
字典中。如果已经看到,则当前子列表与包含该值的先前子列表重叠。
要解决重叠,代码会对包含两个子列表的不相交集合进行联合。
输出内置在output
列表中。因为输入子列表可能比输入中少,所以我们需要一个额外的字典来映射旧索引(从输入)到应用于输出列表的新索引。
为了填充输出列表,我们迭代值,这些值按照它们被添加的顺序发生,这要归功于使用OrderedDict类。使用不相交集,它可以正确组合重叠列表。
当有许多要处理的列表不立即重叠时,该算法具有非常好的性能。例如,这组200个三元素列表最终都是重叠的,但您只能看到在处理前100个之后出现重叠:
lst2 = [list(range(4*i, 4*(i+1)-1)) for i in range(100)] + \
[list(range(4*i+2, 4*(i+1)+1)) for i in range(100)]
答案 2 :(得分:4)
我肯定有一种更清洁的方法可以做到这一点,但我开始沿着某条路走下去,做了我必须做的事情,让它在没有任何重构的情况下发挥作用。
lookup = {}
out = []
index = 0
for grp in lst:
keys = [lookup.get(num, None) for num in grp]
keys = [key for key in keys if key is not None]
if len(keys):
if len(set(keys)) != 1:
for num in grp:
out[keys[0]].append(num)
seen = set()
keys = [key for key in keys if key not in seen and not seen.add(key)]
for key in keys[1:]:
out[keys[0]].extend(out[key])
del out[key]
seen = set()
out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)]
else:
for num in grp:
lookup[num] = keys[0]
out[keys[0]].extend(grp)
seen = set()
out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)]
else:
out.append(grp)
for num in grp:
lookup[num] = index
index += 1
print out
感谢@Steven的集合列表缩减技术。
<强>输出强>
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
答案 3 :(得分:4)
你的问题本质上是一个图论,连接组件的问题,对每个组件的元素顺序有一个附加要求。
在您的程序中,所有列表的集合形成一个无向图,其中每个列表是图中的一个节点。我们说如果两个列表具有共同的元素则直接连接,并且如果存在直接或间接连接的第三个列表,则间接连接。鉴于例如三个列表[a,b],[b,c]和[c,d],然后[a,b]和[b,c]直接连接,以及[b,c]和[c,d] ,但[a,b]和[c,d]是间接连接的,因为虽然它们不共享共同的元素,但它们共享具有相同列表的元素[b,c]。
如果组中的所有节点都是(直接或间接)连接的,并且图中没有其他节点连接到组中的任何节点,则一组节点是连接的组件。
有一个相当简单的线性时间算法,可以在无向图中生成所有连通的组件。使用它,我们可以定义一个函数,生成所有精简的不相交列表列表,同时保持其元素的顺序:
from itertools import imap, combinations_with_replacement
from collections import defaultdict
def connected_components(edges):
neighbors = defaultdict(set)
for a, b in edges:
neighbors[a].add(b)
neighbors[b].add(a)
seen = set()
def component(node, neighbors=neighbors, seen=seen, see=seen.add):
unseen = set([node])
next_unseen = unseen.pop
while unseen:
node = next_unseen()
see(node)
unseen |= neighbors[node] - seen
yield node
for node in neighbors:
if node not in seen:
yield component(node)
def condense(lists):
tuples = combinations_with_replacement(enumerate(imap(tuple, lists)), 2)
overlapping = ((a, b) for a, b in tuples
if not set(a[1]).isdisjoint(b[1]))
seen = set()
see = seen.add
for component in connected_components(overlapping):
yield [item for each in sorted(component)
for item in each[1]
if not (item in seen or see(item))]
print list(condense([[1, 2, 3], [10, 5], [3, 8, 5], [9]]))
print list(condense([[1, 2, 3], [5, 6], [3, 4], [6, 7]]))
结果:
[[1, 2, 3, 10, 5, 8], [9]]
[[5, 6, 7], [1, 2, 3, 4]]
condense()
的时间复杂度是二次的,因为必须针对每个其他列表测试每个列表以查明它们是否具有共同元素。因此,表现很糟糕。我们能以某种方式改善它吗?是。
如果两个列表具有共同元素,则直接连接它们。我们可以改变这种关系:如果它们属于同一个列表,则直接连接两个元素,如果存在连接(直接或间接)它们的第三个元素,则间接连接。鉴于例如两个列表[a,b]和[b,c],然后a和b直接连接,以及b和c,因此a和c间接连接。如果我们也适应J.F.Sebastians关于存储每个元素第一次出现的位置的想法,我们可以像这样实现它:
def condense(lists):
neighbors = defaultdict(set)
positions = {}
position = 0
for each in lists:
for item in each:
neighbors[item].update(each)
if item not in positions:
positions[item] = position
position += 1
seen = set()
def component(node, neighbors=neighbors, seen=seen, see=seen.add):
unseen = set([node])
next_unseen = unseen.pop
while unseen:
node = next_unseen()
see(node)
unseen |= neighbors[node] - seen
yield node
for node in neighbors:
if node not in seen:
yield sorted(component(node), key=positions.get)
它仍然使用连接组件算法,但这次我们将元素视为连接,而不是列表。结果与以前相同,但由于时间复杂度现在是线性的,因此运行得更快。
答案 4 :(得分:4)
我尝试编写快速且易读的解决方案。如果我知道的话,它永远不会比其他解决方案慢得多,但有时会更快,因为它还针对更长的子列表或任何现有组的子集的许多子列表进行了优化。 (这是由问题的文本“我有很多列表但没有很多不同的列表。”)代码使用较少的内存仅用于比原始数据小得多的压缩数据。它可以工作,例如使用生成器从实时过程中收集数据。复杂度的估计是O(n log n)。我认为没有使用排序的算法可能具有线性复杂性。
def condense(lists):
groups = {} # items divided into groups {id(the_set): the_set,...}
members = {} # mapping from item to group
positions = {} # mapping from item to sequential ordering
iposition = 0 # counter for positions
for sublist in lists:
if not sublist or members.get(sublist[0], set()).issuperset(sublist):
continue # speed-up condition if all is from one group
common = set([x for x in sublist if x in members])
if common: # any common group can be a destination for merge
dst_group = members[common.pop()]
common = common - dst_group # are more groups common for sublist?
while common:
grp = members[common.pop()]
if len(grp) > len(dst_group): # merge shorter into longer grp
grp, dst_group = dst_group, grp
dst_group.update(grp)
for item in grp:
members[item] = dst_group
del groups[id(grp)]
common = common - dst_group
else: # or create a new group if nothing common
dst_group = set()
groups[id(dst_group)] = dst_group
newitems = []
for item in sublist: # merge also new items
if item not in positions:
positions[item] = iposition
iposition += 1
newitems.append(item)
members[item] = dst_group
dst_group.update(newitems)
return [sorted(x, key=positions.get) for x in groups.values()]
对于超过大约8个项目的子列表,它比pillmuncher2更快,因为它可以在更多项目上一起工作。对于具有许多类似子列表或许多子列表的列表来说,它也是非常快的,这些子列表是任何现有组的子集。对于lst_OP,它比pillmuncher2快25%,但是对于lst_BK则慢了15%。
长子列表的测试数据示例为[list(range(30)) + [-i] for i in range(100)]
。
我有意写了“common = common - dst_group”而不是使用set运算符-=
或“set.difference_update”,因为如果右侧的集合大得多,则就地更新是无效的然后在左侧。
修改了pillmuncher的解决方案,以便于阅读。由于使用list.append替换生成器,修改比原始版本慢一点。可能是最易读的快速解决方案。
# Modified pillmuncher's solution
from collections import defaultdict
def condense(lists):
neighbors = defaultdict(set) # mapping from items to sublists
positions = {} # mapping from items to sequential ordering
position = 0
for each in lists:
for item in each:
neighbors[item].update(each)
if item not in positions:
positions[item] = position
position += 1
seen = set()
see = seen.add
for node in neighbors:
if node not in seen:
unseen = set([node]) # this is a "todo" set
next_unseen = unseen.pop # method alias, not called now
group = [] # collects the output
while unseen:
node = next_unseen()
see(node)
unseen |= neighbors[node] - seen
group.append(node)
yield sorted(group, key=positions.get)
答案 5 :(得分:3)
class List(list): pass
rv = dict()
def condense_step():
"""find and merge one overlapping pair in rv"""
for i, iv in rv.items():
for j, jv in rv.items():
if i != j and i.intersection(j):
m = i.union(j)
del rv[i]
del rv[j]
rv.setdefault(m, [])
rv[m] += iv
rv[m] += jv
return True
def unique(l):
"""flatten list-of-lists excluding duplicates"""
seen = set()
for i in sum(l, []):
if i not in seen:
seen.add(i)
yield i
def condense(inp):
rv.clear()
inp = map(List, inp)
for i in range(len(inp)):
inp[i].order = i
rv.setdefault(frozenset(inp[i]), [])
rv[frozenset(inp[i])].append(inp[i])
while condense_step():
pass
for v in rv.values():
v.sort(key=lambda x: x.order)
return [list(unique(i)) for i in rv.values()]
答案 6 :(得分:2)
此解决方案仅使用有序词典 如果希望原始副本保持不变,则需要 deepcopy()。
from collections import OrderedDict
from copy import deepcopy
def treat(passed_list):
L = deepcopy(passed_list)
dic = OrderedDict()
for subl in L:
for x in subl:
if x not in dic:
dic[x] = subl
print 'dic at start'
print '\n'.join('%-3s : %s' % (a,dic[a])
for a in dic) + '\n'
for sublist in L:
short = []
short.extend(el for el in sublist
if el not in short)
seen = []
for k,val in dic.iteritems():
if val is sublist:
break
if k in short:
if val not in seen:
seen.append(val)
sumseen = []
for elseen in seen:
for y in elseen:
sumseen.append(y)
dic[y] = sumseen
if seen:
for el in sublist:
if el not in sumseen:
sumseen.append(el)
dic[el] = sumseen
sublist[:] = short
cumul = []
cumul.extend(lu for lu in dic.itervalues()
if lu not in cumul)
return cumul
plus = [[1,2,3,2,1],[10,5,5,5,10],
[8,5,3,3,5],[45,50,12,45,40,12]]
lst = [[1,2,3], [10,5], [3,8,5]]
for one_list in (plus,lst):
print 'one_list before == %r\n' % one_list
print 'treat(one_list) == %r\n' % treat(one_list)
print 'one_list after == %r\n' % one_list
print '===================================='
结果
one_list before == [[1, 2, 3, 2, 1], [10, 5, 5, 5, 10], [8, 5, 3, 3, 5], [45, 50, 12, 45, 40, 12]]
dic at start
1 : [1, 2, 3, 2, 1]
2 : [1, 2, 3, 2, 1]
3 : [1, 2, 3, 2, 1]
10 : [10, 5, 5, 5, 10]
5 : [10, 5, 5, 5, 10]
8 : [8, 5, 3, 3, 5]
45 : [45, 50, 12, 45, 40, 12]
50 : [45, 50, 12, 45, 40, 12]
12 : [45, 50, 12, 45, 40, 12]
40 : [45, 50, 12, 45, 40, 12]
treat(one_list) == [[1, 2, 3, 10, 5, 8], [45, 50, 12, 40]]
one_list after == [[1, 2, 3, 2, 1], [10, 5, 5, 5, 10], [8, 5, 3, 3, 5], [45, 50, 12, 45, 40, 12]]
====================================
one_list before == [[1, 2, 3], [10, 5], [3, 8, 5]]
dic at start
1 : [1, 2, 3]
2 : [1, 2, 3]
3 : [1, 2, 3]
10 : [10, 5]
5 : [10, 5]
8 : [3, 8, 5]
treat(one_list) == [[1, 2, 3, 10, 5, 8]]
one_list after == [[1, 2, 3], [10, 5], [3, 8, 5]]
====================================
这种解决方案带来了不便:它比J.F.Sebastian的解决方案慢2到3倍。