我有数千行1到100个数字,每行定义一组数字以及它们之间的关系。 我需要得到一组相关的数字。
小例子: 如果我有这7行数据
T1 T2
T3
T4
T5
T6 T1
T5 T4
T3 T4 T7
我需要一个不那么慢的算法来知道这里的集合是:
T1 T2 T6 (because T1 is related with T2 in the first line and T1 related with T6 in the line 5)
T3 T4 T5 T7 (because T5 is with T4 in line 6 and T3 is with T4 and T7 in line 7)
但是当你拥有非常大的集合时,在每个大集合中搜索T(x)都会非常缓慢,并且需要集合等等。
你是否有提示以不那么强力的方式做到这一点?
我正在尝试用Python做这件事。
答案 0 :(得分:14)
将您的数字T1,T2等视为图顶点。在一条线上出现的任何两个数字由边连接。那么您的问题就等于找到此图表中的所有connected components。您可以通过从T1开始,然后执行广度优先或深度优先搜索来查找从该点可到达的所有顶点。 将所有这些顶点标记为属于等价类T1。然后找到下一个未标记的顶点Ti,找到从那里可到达的所有尚未标记的节点,并将它们标记为属于等价类Ti。继续,直到标记了所有顶点。
对于具有n个顶点和e边的图,该算法需要O(e)时间和空间来构建邻接列表,并且O(n)时间和空间用于标识所有连接的组件 一旦构建了图形结构。
答案 1 :(得分:14)
一旦构建了数据结构,您想要针对它运行哪些查询?向我们展示您现有的代码。什么是T(x)?你谈到“数字组”,但你的样本数据显示T1,T2等;请解释一下。
你读过这个:http://en.wikipedia.org/wiki/Disjoint-set_data_structure
尝试查看此Python实现:http://code.activestate.com/recipes/215912-union-find-data-structure/
或者你可以自己抨击一些相当简单易懂的东西,例如[更新:全新代码]
class DisjointSet(object):
def __init__(self):
self.leader = {} # maps a member to the group's leader
self.group = {} # maps a group leader to the group (which is a set)
def add(self, a, b):
leadera = self.leader.get(a)
leaderb = self.leader.get(b)
if leadera is not None:
if leaderb is not None:
if leadera == leaderb: return # nothing to do
groupa = self.group[leadera]
groupb = self.group[leaderb]
if len(groupa) < len(groupb):
a, leadera, groupa, b, leaderb, groupb = b, leaderb, groupb, a, leadera, groupa
groupa |= groupb
del self.group[leaderb]
for k in groupb:
self.leader[k] = leadera
else:
self.group[leadera].add(b)
self.leader[b] = leadera
else:
if leaderb is not None:
self.group[leaderb].add(a)
self.leader[a] = leaderb
else:
self.leader[a] = self.leader[b] = a
self.group[a] = set([a, b])
data = """T1 T2
T3 T4
T5 T1
T3 T6
T7 T8
T3 T7
T9 TA
T1 T9"""
# data is chosen to demonstrate each of 5 paths in the code
from pprint import pprint as pp
ds = DisjointSet()
for line in data.splitlines():
x, y = line.split()
ds.add(x, y)
print
print x, y
pp(ds.leader)
pp(ds.group)
这是最后一步的输出:
T1 T9
{'T1': 'T1',
'T2': 'T1',
'T3': 'T3',
'T4': 'T3',
'T5': 'T1',
'T6': 'T3',
'T7': 'T3',
'T8': 'T3',
'T9': 'T1',
'TA': 'T1'}
{'T1': set(['T1', 'T2', 'T5', 'T9', 'TA']),
'T3': set(['T3', 'T4', 'T6', 'T7', 'T8'])}
答案 2 :(得分:2)
您可以使用联合查找数据结构来实现此目标。
这种算法的伪代码如下:
func find( var element )
while ( element is not the root ) element = element's parent
return element
end func
func union( var setA, var setB )
var rootA = find( setA ), rootB = find( setB )
if ( rootA is equal to rootB ) return
else
set rootB as rootA's parent
end func
答案 3 :(得分:1)
正如上面提到的Jim,您实际上是在寻找简单无向图的connected components,其中节点是您的实体(T1
,T2
等等) ,edge表示它们之间的成对关系。连接组件搜索的简单实现基于广度优先搜索:从第一个实体启动BFS,找到所有相关实体,然后从第一个尚未发现的实体启动另一个BFS,依此类推,直到找到它们为止所有。 BFS的简单实现如下:
class BreadthFirstSearch(object):
"""Breadth-first search implementation using an adjacency list"""
def __init__(self, adj_list):
self.adj_list = adj_list
def run(self, start_vertex):
"""Runs a breadth-first search from the given start vertex and
yields the visited vertices one by one."""
queue = deque([start_vertex])
visited = set([start_vertex])
adj_list = self.adj_list
while queue:
vertex = queue.popleft()
yield vertex
unseen_neis = adj_list[vertex]-visited
visited.update(unseen_neis)
queue.extend(unseen_neis)
def connected_components(graph):
seen_vertices = set()
bfs = BreadthFirstSearch(graph)
for start_vertex in graph:
if start_vertex in seen_vertices:
continue
component = list(bfs.run(start_vertex))
yield component
seen_vertices.update(component)
这里,adj_list
或graph
是一个邻接列表数据结构,基本上它给出了图中给定顶点的邻居。要从您的文件构建它,您可以这样做:
adj_list = defaultdict(set)
for line in open("your_file.txt"):
parts = line.strip().split()
v1 = parts.pop(0)
adj_list[v1].update(parts)
for v2 in parts:
adj_list[v2].add(v1)
然后你可以运行:
components = list(connected_components(adj_list))
当然,在纯Python中实现整个算法往往比使用更高效的图形数据结构的C中的实现慢。您可以考虑使用igraph或其他图表库(如NetworkX)来完成工作。两个库都包含连接组件搜索的实现;在igraph
中,它归结为此(假设您的文件不包含具有单个条目的行,仅接受成对条目):
>>> from igraph import load
>>> graph = load("edge_list.txt", format="ncol", directed=False)
>>> components = graph.clusters()
>>> print graph.vs[components[0]]["name"]
['T1', 'T2', 'T6']
>>> print graph.vs[components[1]]["name"]
['T3', 'T4', 'T5']
免责声明:我是igraph
的作者之一答案 4 :(得分:0)
您可以使用set
为群组建模。在下面的示例中,我将集合放入Group类中,以便更容易保持对它们的引用并跟踪一些名义上的“head”项。
class Group:
def __init__(self,head):
self.members = set()
self.head = head
self.add(head)
def add(self,member):
self.members.add(member)
def union(self,other):
self.members = other.members.union(self.members)
groups = {}
for line in open("sets.dat"):
line = line.split()
if len(line) == 0:
break
# find the group of the first item on the row
head = line[0]
if head not in groups:
group = Group(head)
groups[head] = group
else:
group = groups[head]
# for each other item on the row, merge the groups
for node in line[1:]:
if node not in groups:
# its a new node, straight into the group
group.add(node)
groups[node] = group
elif head not in groups[node].members:
# merge two groups
new_members = groups[node]
group.union(new_members)
for migrate in new_members.members:
groups[migrate] = group
# list them
for k,v in groups.iteritems():
if k == v.head:
print v.members
输出是:
set(['T6', 'T2', 'T1'])
set(['T4', 'T5', 'T3'])