从Nx2整数numpy数组有效地分组行以匹配行中的顺序元素

时间:2013-07-28 21:20:45

标签: python numpy

好吧,对那些那些笨拙的速度怪胎来说,有一个奇怪的。

我有以下数据:

  • Nx2整数值数组
  • 0到N-1之间的每个整数都出现两次
  • 数据中会有一个或多个'循环'。

'loop'将是有序行的子集,这样每行共享一个元素,其上面有一行,另一个元素与下面的行共享。目标是找到产生闭环的数据的索引数组。

示例数据(单循环):

In: data = np.array([[0, 7],
                     [1, 8],
                     [2, 9],
                     [3, 0],
                     [4, 1],
                     [5, 2],
                     [6, 3],
                     [4, 7],
                     [8, 5],
                     [9, 6]])

示例解决方案:

In: ordered_indices = np.array([0, 7, 4, 1, 8, 5, 2, 9, 6, 3])

In: data[ordered_indices]
Out: array([[0, 7],
            [4, 7],
            [4, 1],
            [1, 8],
            [8, 5],
            [5, 2],
            [2, 9],
            [9, 6],
            [6, 3],
            [3, 0]])

无法保证行中元素的顺序;也就是说,7可能是它出现的两行中的第一个元素,或者第一个在一个中,第二个在另一个中。

数据大约为N = 1000;带循环的解决方案太慢了。

为了方便起见,可以使用下面的脚本生成典型数据。这里,有序数据的索引遵循周期性模式,但在实际数据中则不然。

生成样本数据:

import numpy as np
import sys

# parameters
N = 1000
M = 600

# initialize array
data = np.empty((N,2), dtype=np.int)

# populate first column
data[:,0] = np.arange(N)

# populate second column by shifting first column; create two loops within the data
inds1 = np.arange(0,M)[np.arange(-7,M-7)]
inds2 = np.arange(M,N)[np.arange(-9,N-M-9)]
data[:M,1] = data[inds1,0]
data[M:,1] = data[inds2,0]

# shuffle order of two entries in rows
map(np.random.shuffle, data)

我写了一个方法可以获得正确的结果,但它很慢(在我老化的笔记本电脑上大约0.5秒):

基线解决方案:

def groupRows(rows):

    # create a list of indices
    ungrouped_rows = range(len(rows))

    # initialize list of lists of indices
    grouped_rows = []

    # loop until there are no ungrouped rows
    while 0 < len(ungrouped_rows):

        # remove a row from the beginning of the list
        row_index = ungrouped_rows.pop(0)

        # start a new list of rows
        grouped_rows.append([row_index])

        # get the element at the start of the loop
        stop_at = data[grouped_rows[-1][0],0]

        # search target
        look_for = data[grouped_rows[-1][0],1]

        # continue until loop is closed
        closed = False

        while not closed:

            # for every row left in the ungrouped list
            for i, row_index in enumerate(ungrouped_rows):

                # get two elements in the row being checked
                b1,b2 = data[row_index]

                # if row contains the current search target
                if look_for in (b1,b2):

                    # add it to the current group
                    grouped_rows[-1].append(ungrouped_rows.pop(i))

                    # update search target
                    if look_for == b1:
                        look_for = b2
                    elif look_for == b2:
                        look_for = b1

                    # exit the loop through the ungrouped rows
                    break

            # check if the loop is closed
            if look_for == stop_at:
                closed = True

    return map(np.array, grouped_rows)

所以我的方法有效,但有列表和两个嵌套循环;使用numpy更有效的方法必须有一个更流畅的方法来做到这一点。有什么想法吗?

2 个答案:

答案 0 :(得分:0)

如果你想找到一个最长的“闭环”,我认为你的问题是NP完全的(在无向图中找到一个最长的简单循环)。

如果你想找到一个任意的循环,尝试深度优先搜索 - 1000个元素大约0.02s:

from collections import defaultdict

def ordered(data, N):
    edges = defaultdict()
    for v1, v2 in data:
        edges.setdefault(v1, []).append(v2)
        edges.setdefault(v2, []).append(v1)

    visited = [False] * N
    path = None
    for v in range(N):
        if not visited[v]:
            path = dfs(edges, visited, v)
            if path is not None: break
    if path is not None:
        return [[path[i], path[i + 1]] for i in range(len(path) - 1)]


def dfs(edges, visited, v1, vp=None):
    path = [v1]
    if visited[v1]: return path
    visited[v1] = True
    for v2 in edges[v1]:
        if v2 == vp: continue
        path_child = dfs(edges, visited, v2, v1)
        if path_child is not None: return path + path_child
    return None

data = [[0, 7], [1, 8], [2, 9], [3, 0], [4, 1], [5, 2], [6, 3], [4, 7], [8, 5], [9, 6]]
N = 10
ord = ordered(data, N)

[[0, 7], [7, 4], [4, 1], [1, 8], [8, 5], [5, 2], [2, 9], [9, 6], [6, 3], [3, 0]]

答案 1 :(得分:0)

这是骗取scipy.sparse.csgraph模块:

data = np.random.permutation(data)

import scipy.sparse    
def find_closed_loops(data):
    incidence = scipy.sparse.csr_matrix((np.ones(data.size), (data.flatten(), np.arange(data.size)//2)))
    adjecency = incidence.T * incidence
    n, labels = scipy.sparse.csgraph.connected_components(adjecency, return_labels=True)
    for l in range(n):
        yield np.flatnonzero(labels == l)

for idx in find_closed_loops(data):
    print(idx)