我有以下数据:
'loop'将是有序行的子集,这样每行共享一个元素,其上面有一行,另一个元素与下面的行共享。目标是找到产生闭环的数据的索引数组。
示例数据(单循环):
In: data = np.array([[0, 7],
[1, 8],
[2, 9],
[3, 0],
[4, 1],
[5, 2],
[6, 3],
[4, 7],
[8, 5],
[9, 6]])
示例解决方案:
In: ordered_indices = np.array([0, 7, 4, 1, 8, 5, 2, 9, 6, 3])
In: data[ordered_indices]
Out: array([[0, 7],
[4, 7],
[4, 1],
[1, 8],
[8, 5],
[5, 2],
[2, 9],
[9, 6],
[6, 3],
[3, 0]])
无法保证行中元素的顺序;也就是说,7可能是它出现的两行中的第一个元素,或者第一个在一个中,第二个在另一个中。
数据大约为N = 1000;带循环的解决方案太慢了。
为了方便起见,可以使用下面的脚本生成典型数据。这里,有序数据的索引遵循周期性模式,但在实际数据中则不然。
生成样本数据:
import numpy as np
import sys
# parameters
N = 1000
M = 600
# initialize array
data = np.empty((N,2), dtype=np.int)
# populate first column
data[:,0] = np.arange(N)
# populate second column by shifting first column; create two loops within the data
inds1 = np.arange(0,M)[np.arange(-7,M-7)]
inds2 = np.arange(M,N)[np.arange(-9,N-M-9)]
data[:M,1] = data[inds1,0]
data[M:,1] = data[inds2,0]
# shuffle order of two entries in rows
map(np.random.shuffle, data)
我写了一个方法可以获得正确的结果,但它很慢(在我老化的笔记本电脑上大约0.5秒):
基线解决方案:
def groupRows(rows):
# create a list of indices
ungrouped_rows = range(len(rows))
# initialize list of lists of indices
grouped_rows = []
# loop until there are no ungrouped rows
while 0 < len(ungrouped_rows):
# remove a row from the beginning of the list
row_index = ungrouped_rows.pop(0)
# start a new list of rows
grouped_rows.append([row_index])
# get the element at the start of the loop
stop_at = data[grouped_rows[-1][0],0]
# search target
look_for = data[grouped_rows[-1][0],1]
# continue until loop is closed
closed = False
while not closed:
# for every row left in the ungrouped list
for i, row_index in enumerate(ungrouped_rows):
# get two elements in the row being checked
b1,b2 = data[row_index]
# if row contains the current search target
if look_for in (b1,b2):
# add it to the current group
grouped_rows[-1].append(ungrouped_rows.pop(i))
# update search target
if look_for == b1:
look_for = b2
elif look_for == b2:
look_for = b1
# exit the loop through the ungrouped rows
break
# check if the loop is closed
if look_for == stop_at:
closed = True
return map(np.array, grouped_rows)
所以我的方法有效,但有列表和两个嵌套循环;使用numpy更有效的方法必须有一个更流畅的方法来做到这一点。有什么想法吗?
答案 0 :(得分:0)
如果你想找到一个最长的“闭环”,我认为你的问题是NP完全的(在无向图中找到一个最长的简单循环)。
如果你想找到一个任意的循环,尝试深度优先搜索 - 1000个元素大约0.02s:
from collections import defaultdict
def ordered(data, N):
edges = defaultdict()
for v1, v2 in data:
edges.setdefault(v1, []).append(v2)
edges.setdefault(v2, []).append(v1)
visited = [False] * N
path = None
for v in range(N):
if not visited[v]:
path = dfs(edges, visited, v)
if path is not None: break
if path is not None:
return [[path[i], path[i + 1]] for i in range(len(path) - 1)]
def dfs(edges, visited, v1, vp=None):
path = [v1]
if visited[v1]: return path
visited[v1] = True
for v2 in edges[v1]:
if v2 == vp: continue
path_child = dfs(edges, visited, v2, v1)
if path_child is not None: return path + path_child
return None
data = [[0, 7], [1, 8], [2, 9], [3, 0], [4, 1], [5, 2], [6, 3], [4, 7], [8, 5], [9, 6]]
N = 10
ord = ordered(data, N)
[[0, 7], [7, 4], [4, 1], [1, 8], [8, 5], [5, 2], [2, 9], [9, 6], [6, 3], [3, 0]]
答案 1 :(得分:0)
这是骗取scipy.sparse.csgraph模块:
data = np.random.permutation(data)
import scipy.sparse
def find_closed_loops(data):
incidence = scipy.sparse.csr_matrix((np.ones(data.size), (data.flatten(), np.arange(data.size)//2)))
adjecency = incidence.T * incidence
n, labels = scipy.sparse.csgraph.connected_components(adjecency, return_labels=True)
for l in range(n):
yield np.flatnonzero(labels == l)
for idx in find_closed_loops(data):
print(idx)