Question

现实问题：

我有许多公司的董事数据，但有时候＆＃34;约翰史密斯，XYZ＆＃34;和＃34; ABC＆＃34;是同一个人，有时他们不是。也是＆＃34; John J. Smith，XYZ＆＃34;和＃34; ABC＆＃34;可能是同一个人，也可能不是。通常检查附加信息（例如，对＃34; XYZ主任John Smith和＃34;以及＃34;＆＃34; John Smith，ABC＆＃34;主任）的传记数据的比较使得有可能解决两个观察是否是同一个人与否。

问题的概念版本：

本着这种精神，我正在收集识别匹配对的数据。例如，假设我有以下匹配对：{(a, b), (b, c), (c, d), (d, e), (f, g)}。我想使用关系的传递属性＆＃34;与＃34;是同一个人。生成＆＃34;连接组件＆＃34; {{a, b, c, d, e}, {f, g}}。那是{a, b, c, d, e}是一个人而{f, g}是另一个人。（问题的早期版本提及＆＃34; cliques＆＃34;，这显然是别的;这可以解释为什么find_cliques中的networkx给出了错误的＆＃34;结果（为了我的目的）。

以下Python代码完成了这项工作。但我想知道：是否有更好的（计算成本更低）的方法（例如，使用标准或可用的库）？

这里和那里似乎有相关的例子（例如Cliques in python），但这些不完整，所以我不确定他们指的是哪些库，或者如何设置我的数据来使用它们。 / p>

示例Python 2代码：

def get_cliques(pairs):
    from sets import Set

    set_list = [Set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for set in set_list:
            if pair[0] in set or pair[1] in set:
                set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(Set(pair))

    return set_list

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

print(get_cliques(pairs))

这会产生所需的输出：[Set(['a', 'c', 'b', 'e', 'd']), Set(['g', 'f'])]。

示例Python 3代码：

这会产生[set(['a', 'c', 'b', 'e', 'd']), set(['g', 'f'])]）：

def get_cliques(pairs):

    set_list = [set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for a_set in set_list:
            if pair[0] in a_set or pair[1] in a_set:
                a_set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(set(pair))

    return set_list

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

print(get_cliques(pairs))

Answer 1

使用networkX：

import networkx as nx
G1=nx.Graph()
G1.add_edges_from([("a","b"),("b","c"),("c","d"),("d","e"),("f","g")])
sorted(nx.connected_components(G1), key = len, reverse=True)

，并提供：

[['a', 'd', 'e', 'b', 'c'], ['f', 'g']]

你现在必须检查最快的算法......

OP：

这很棒！我现在在PostgreSQL数据库中有这个。只需将对组织成一个两列表，然后使用array_agg()传递给PL / Python函数get_connected()。感谢。

CREATE OR REPLACE FUNCTION get_connected(
    lhs text[],
    rhs text[])
  RETURNS SETOF text[] AS
$BODY$
    pairs = zip(lhs, rhs)

    import networkx as nx
    G=nx.Graph()
    G.add_edges_from(pairs)
    return sorted(nx.connected_components(G), key = len, reverse=True)

$BODY$ LANGUAGE plpythonu;

（注意：我编辑了答案，因为我认为显示这一步可能是有用的附录，但评论的时间太长了。）

Answer 2

我不相信（如果我错了，请纠正我），这与最大的集团问题直接相关。 cliques（维基百科）的定义表明，无向图中的一个集团是其顶点的一个子集，这样子集中的每两个顶点都由一个边连接。在这种情况下，我们希望找到哪些节点可以相互联系（甚至间接）。

我做了一些样品。它构建一个图形并遍历它寻找邻居。这应该非常有效，因为每个节点只在组形成时遍历一次。

from collections import defaultdict

def get_cliques(pairs):
    # Build a graph using the pairs
    nodes = defaultdict(lambda: [])
    for a, b in pairs:
        if b is not None:
            nodes[a].append((b, nodes[b]))
            nodes[b].append((a, nodes[a]))
        else:
            nodes[a]  # empty list

    # Add all neighbors to the same group    
    visited = set()
    def _build_group(key, group):
        if key in visited:
            return
        visited.add(key)
        group.add(key)
        for key, _ in nodes[key]:
            _build_group(key, group)

    groups = []
    for key in nodes.keys():
        if key in visited: continue
        groups.append(set())
        _build_group(key, groups[-1])

    return groups

if __name__ == '__main__':
    pairs = [
        ('a', 'b'), ('b', 'c'), ('b', 'd'), # a "tree"
        ('f', None),                        # no relations
        ('h', 'i'), ('i', 'j'), ('j', 'h')  # circular
    ]
    print get_cliques(pairs)
    # Output: [set(['a', 'c', 'b', 'd']), set(['f']), set(['i', 'h', 'j'])]

如果您的数据集最好像图表一样大而且非常大，那么像Neo4j这样的图表数据库是否合适？

Answer 3

帝斯曼的评论让我在Python中寻找集合合算法。 Rosetta Code有两个版本的相同算法。示例用法（非递归版本）：

[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]

# Copied from Rosetta Code
def consolidate(sets):
    setlist = [s for s in sets if s]
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return [s for s in setlist if s]

print consolidate([set(pair) for pair in pairs])
# Output: [set(['a', 'c', 'b', 'd']), set([None, 'f']), set(['i', 'h', 'j'])]

Answer 4

我尝试了使用字典作为查找的替代实现，并且可能会略微减少计算延迟。

# Modified to use a dictionary
from collections import defaultdict

def get_cliques2(pairs):
  maxClique = 1
  clique = defaultdict(int)
  for (a, b) in pairs:
    currentClique = max(clique[i] for i in (a,b))
    if currentClique == 0:
      currentClique = maxClique
      maxClique += 1
    clique[a] = clique[b] = currentClique
  reversed = defaultdict(list)
  for (k, v) in clique.iteritems(): reversed[v].append(k)
  return reversed

只是为了让自己相信它会返回正确的结果（get_cliques1这是你原来的Python 2解决方案）：

>>> from cliques import *
>>> get_cliques1(pairs) # Original Python 2 solution
[Set(['a', 'c', 'b', 'e', 'd']), Set(['g', 'f'])]
>>> get_cliques2(pairs) # Dictionary-based alternative
[['a', 'c', 'b', 'e', 'd'], ['g', 'f']]

以秒为单位的时间信息（重复1000万次）：

$ python get_times.py 
get_cliques: 75.1285209656
get_cliques2: 69.9816100597

为了完整性和参考，这是cliques.py和get_times.py时间脚本的完整列表：

# cliques.py
# Python 2.7
from collections import defaultdict
from sets import Set  # I moved your import out of the function to try to get closer to apples-apples

# Original Python 2 solution
def get_cliques1(pairs):

    set_list = [Set(pairs[0])]

    for pair in pairs[1:]:
        matched=False
        for set in set_list:
            if pair[0] in set or pair[1] in set:
                set.update(pair)
                matched=True
                break
        if not matched:
            set_list.append(Set(pair))

    return set_list

# Modified to use a dictionary
def get_cliques2(pairs):
  maxClique = 1
  clique = defaultdict(int)
  for (a, b) in pairs:
    currentClique = max(clique[i] for i in (a,b))
    if currentClique == 0:
      currentClique = maxClique
      maxClique += 1
    clique[a] = clique[b] = currentClique
  reversed = defaultdict(list)
  for (k, v) in clique.iteritems(): reversed[v].append(k)
  return reversed.values()

pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]


# get_times.py
# Python 2.7
from timeit import timeit

REPS = 10000000

print "get_cliques: " + str(timeit(
  stmt='get_cliques1(pairs)', setup='from cliques import get_cliques1, pairs',
  number=REPS
))
print "get_cliques2: " + str(timeit(
  stmt='get_cliques2(pairs)', setup='from cliques import get_cliques2, pairs',
  number=REPS
))

至少在这种人为的情景中，有一个可衡量的加速。它确实不是开创性的，我确信在我的实现中我在表格上留下了一些性能，但是它可能会帮助你思考其他选择吗？

如何将匹配对聚合到＆＃34;连接组件＆＃34;在Python中

现实问题：

问题的概念版本：

示例Python 2代码：

示例Python 3代码：

4 个答案: