合并具有重叠元素的列表

时间:2019-06-12 16:57:28

标签: python

我有一个列表集合,其中一些具有重叠的元素:

coll = [['aaaa', 'aaab', 'abaa'],
        ['bbbb', 'bbbb'], 
        ['aaaa', 'bbbb'], 
        ['dddd', 'dddd'],
        ['bbbb', 'bbbb', 'cccc','aaaa'],
        ['eeee','eeef','gggg','gggi'],
        ['gggg','hhhh','iiii']]

我只想将重叠的列表集中在一起,这样会产生

pooled = [['aaaa', 'aaab', 'abaa','bbbb','cccc'], 
          ['eeee','eeef','gggg','gggi','hhhh','iiii'],
          ['dddd', 'dddd']]

(如果不清楚,第一个和第二个列表都与第三个列表重叠,因此即使它们本身不包含相同的元素,也应将它们全部合并在一起。)

“重叠”是指两个列表至少具有一个共同的元素。 “合并”是指将两个列表合并为一个平面列表或单个平面集。

可能有几套,例如x,y和z彼此重叠,v和w彼此重叠,但是x + y + z不与v + w重叠。可能有些列表没有任何重叠。

(比喻是家庭。将所有Montague家族联合起来,将所有Capulets家族联合在一起,但是没有Montague曾经嫁给Capulets家族,因此这两个群体将保持不同。)

我不在乎重复项是否被多次包含。

在Python中执行此操作的简单且合理快速的方法是什么?

编辑:这似乎不是Yet another merging list of lists, but most pythonic way的重复,因为这似乎没有考虑仅通过第三组重叠的组。我从该问题尝试过的解决方案无法在这里找到想要的答案。

3 个答案:

答案 0 :(得分:1)

这是一种实现方法(假设您想要重叠结果中的唯一元素):

def over(coll):
     print('Input is:\n', coll)
     # gather the lists that do overlap 
     overlapping = [x for x in coll if any(x_element in [y for k in coll if k != x for y in k] for x_element in x)] 
     # flatten and get unique 
     overlapping = sorted(list(set([z for x in overlapping for z in x]))) 
     # get the rest
     non_overlapping = [x for x in coll if all(y not in overlapping for y in x)] 
     # use the line bellow only if merged non-overlapping elements are desired
     # non_overlapping = sorted([y for x in non_overlapping for y in x]) 
     print('Output is"\n',[overlapping, non_overlapping])

coll = [['aaaa', 'aaab', 'abaa'],
        ['bbbb', 'bbbb'], 
        ['aaaa', 'bbbb'], 
        ['dddd', 'dddd'],
        ['bbbb', 'bbbb', 'cccc','aaaa']]
over(coll)
coll = [['aaaa', 'aaaa'], ['bbbb', 'bbbb']]
over(coll)

输出:

$ python3 over.py                                                                                                                                                              -- NORMAL --
Input is:
 [['aaaa', 'aaab', 'abaa'], ['bbbb', 'bbbb'], ['aaaa', 'bbbb'], ['dddd', 'dddd'], ['bbbb', 'bbbb', 'cccc', 'aaaa']]
Output is"
 [['aaaa', 'aaab', 'abaa', 'bbbb', 'cccc'], [['dddd', 'dddd']]]
Input is:
 [['aaaa', 'aaaa'], ['bbbb', 'bbbb']]
Output is"
 [[], [['aaaa', 'aaaa'], ['bbbb', 'bbbb']]]


答案 1 :(得分:1)

您可以使用连续合并方法对集合进行此操作:

coll = [['aaaa', 'aaab', 'abaa'],
        ['bbbb', 'bbbb'], 
        ['aaaa', 'bbbb'], 
        ['dddd', 'dddd'],
        ['bbbb', 'bbbb', 'cccc','aaaa'],
        ['eeee','eeef','gggg','gggi'],
        ['gggg','hhhh','iiii']]

pooled = [set(subList) for subList in coll]
merging = True
while merging:
    merging=False
    for i,group in enumerate(pooled):
        merged = next((g for g in pooled[i+1:] if g.intersection(group)),None)
        if not merged: continue
        group.update(merged)
        pooled.remove(merged)
        merging = True

print(pooled)
# [{'aaaa', 'abaa', 'aaab', 'cccc', 'bbbb'}, {'dddd'}, {'gggg', 'eeef', 'eeee', 'hhhh', 'gggi', 'iiii'}]

答案 2 :(得分:0)

根据评论中alkasm的建议,我使用了networkx:

import networkx as nx

coll = [['aaaa', 'aaab', 'abaa'],
        ['bbbb', 'bbbb'], 
        ['aaaa', 'bbbb'], 
        ['dddd', 'dddd'],
        ['bbbb', 'bbbb', 'cccc','aaaa'],
        ['eeee','eeef','gggg','gggi'],
        ['gggg','hhhh','iiii']]

edges = []
for i in range(len(coll)):
    a = coll[i]
    for j in range(len(coll)):
        if i != j:
            b = coll[j]
            if set(a).intersection(set(b)):
                edges.append((i,j))

G = nx.Graph()
G.add_nodes_from(range(len(coll)))
G.add_edges_from(edges)

for c in nx.connected_components(G):
    combined_lists = [coll[i] for i in c]
    flat_list = [item for sublist in combined_lists for item in sublist]
    print(set(flat_list))

输出:

{'cccc', 'bbbb', 'aaab', 'aaaa', 'abaa'}
{'dddd'}
{'eeef', 'eeee', 'hhhh', 'gggg', 'gggi', 'iiii'}

毫无疑问,这可以优化,但是现在看来已经解决了我的问题。