从2D numpy数组中删除运行

时间:2014-06-25 11:27:22

标签: python arrays algorithm numpy

给出2D numpy数组:

00111100110111
01110011000110
00111110001000
01101101001110

是否有一种有效的方法来替换长度为1的{​​{1}}次广告?

例如,如果>= N

N=3

实际上2D数组是二进制的,我想用0替换1的运行,但为了清楚起见,我在上面的例子中用2替换它们。

可运行示例:http://runnable.com/U6q0q-TFWzxVd_Uf/numpy-replace-runs-for-python

我目前使用的代码看起来有点hacky,我觉得可能有一些神奇的方式:

更新:我知道我将示例更改为不处理极端情况的版本。这是一个小的实现错误(现已修复)。如果有一种更快的方式,我更感兴趣。

00222200110222
02220011000110
00222220001000
01101101002220

输出:

import numpy as np
import time

def replace_runs(a, search, run_length, replace = 2):
  a_copy = a.copy() # Don't modify original
  for i, row in enumerate(a):
    runs = []
    current_run = []
    for j, val in enumerate(row):
      if val == search:
        current_run.append(j)
      else:
        if len(current_run) >= run_length or j == len(row) -1:
          runs.append(current_run)
        current_run = []

    if len(current_run) >= run_length or j == len(row) -1:
      runs.append(current_run)

    for run in runs:
      for col in run:
        a_copy[i][col] = replace

  return a_copy

arr = np.array([
  [0,0,1,1,1,1,0,0,1,1,0,1,1,1],
  [0,1,1,1,0,0,1,1,0,0,0,1,1,0],
  [0,0,1,1,1,1,1,0,0,0,1,0,0,0],
  [0,1,1,0,1,1,0,1,0,0,1,1,1,0],
  [1,1,1,1,1,1,1,1,1,1,1,1,1,1],
  [0,0,0,0,0,0,0,0,0,0,0,0,0,0],
  [1,1,1,1,1,1,1,1,1,1,1,1,1,0],
  [0,1,1,1,1,1,1,1,1,1,1,1,1,1],
])

print arr
print replace_runs(arr, 1, 3)

iterations = 100000

t0 = time.time()
for i in range(0,iterations):
  replace_runs(arr, 1, 3)
t1 = time.time()

print "replace_runs: %d iterations took %.3fs" % (iterations, t1 - t0)

6 个答案:

答案 0 :(得分:1)

我认为输入是一维数组,因为它推广到两个维度。

在二进制文件中,您可以使用1检查两个项目是否&。在numpy,你可以"转移"通过切片有效地生成一个数组。因此,创建第二个数组,其中在您要取消设置(或更改为两个)的所有位置都有1。然后^+进入原始版本,具体取决于您是否要将0或者两个加入其中:

def unset_ones(a, n):
    match = a[:-n].copy()
    for i in range(1, n): # find 1s that have n-1 1s following
        match &= a[i:i-n]
    matchall = match.copy()
    matchall.resize(match.size + n)
    for i in range(1, n): # make the following n-1 1s as well
        matchall[i:i-n] |= match
    b = a.copy()
    b ^= matchall # xor into the original data; replace by + to make 2s
    return b

示例:

>>> unset_ones(np.array([0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0]), 3)
array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0])

答案 1 :(得分:1)

通过卷积使用模式匹配:

def replace_runs(a, N, replace = 2):
    a_copy = a.copy()
    pattern = np.ones(N, dtype=int)
    M = a_copy.shape[1]

    for i, row in enumerate(a_copy):
        conv = np.convolve(row, pattern, mode='same')
        match = np.where(conv==N)

        a_copy[i][match]=replace
        a_copy[i][match[0][match[0]-1>0]-1]=replace
        a_copy[i][match[0][match[0]+1<M]+1]=replace
    return a_copy

比原始replace_runs慢3倍,但检测到极端情况(如建议的基于字符串的方法)。

在我的机器上:

replace_runs_org:100000次迭代需要12.792s

replace_runs_var:100000次迭代需要33.112次

答案 2 :(得分:1)

首先,您的代码无法正常工作......它正在2替换第二行末尾只有两个1的群集。也就是说,以下是您的文字所描述的内容:

def replace_runs_bis(arr, search=1, n=3, val=2):
    ret = np.array(arr) # this makes a copy by default
    rows, cols = arr.shape
    # Fast convolution with an all 1's kernel
    arr_cum = np.cumsum(arr == search, axis=1)
    arr_win = np.empty((rows, cols-n+1), dtype=np.intp)
    arr_win[:, 0] = arr_cum[:, n-1]
    arr_win[:, 1:] = arr_cum[:, n:] - arr_cum[:, :-n]
    mask_win = arr_win >= n
    # mask_win is True for n item windows all full of searchs, expand to pixels
    mask = np.zeros_like(arr, dtype=np.bool)
    for j in range(n):
        sl_end = -n+j+1
        sl_end = sl_end if sl_end else None
        mask[:, j:sl_end] |= mask_win
    #replace values
    ret[mask] = val

    return ret

对于你的样本数组,它的速度提高了约2倍,但我猜它对于较大的数组会更快,只要n保持较小。

In [23]: %timeit replace_runs(arr, 1, 3)
10000 loops, best of 3: 163 µs per loop

In [24]: %timeit replace_runs_bis(arr, 1, 3)
10000 loops, best of 3: 80.9 µs per loop

答案 3 :(得分:1)

纯Python

您可能想要测试您的代码,它似乎没有达到预期效果。请运行此脚本,针对我的代码测试您的代码并检查输出:

import numpy as np

def find_first(a, index, value):
    while index<a.size and a[index]!=value:
        index += 1
    return index

def find_end(a, index, value):
    while index<a.size and a[index]==value:
        index += 1
    return index

def replace_run(a, begin, end, threshold, replace):
    if end-begin+1 > threshold:
        a[begin:end] = replace

def process_row(a, value, threshold, replace):
    first = 0
    while first < a.size:
        if a[first]==value:
            end = find_end(a, first, value)
            replace_run(a, first, end, threshold, replace)
            first = end
        else:
            first = find_first(a, first, value)

def replace_py(a, value, length, replace):
    mat = a.copy()
    for row in mat:
        process_row(row, value, length, replace)
    return mat

################################################################################
# Your code as posted in the question:

def replace_runs(a, search, run_length, replace = 2):
  a_copy = a.copy() # Don't modify original
  for i, row in enumerate(a):
    runs = []
    current_run = []
    for j, val in enumerate(row):
      if val == search:
        current_run.append(j)
      else:
        if len(current_run) >= run_length or j == len(row) -1:
          runs.append(current_run)
        current_run = []

    if len(current_run) >= run_length or j == len(row) -1:
      runs.append(current_run)

    for run in runs:
      for col in run:
        a_copy[i][col] = replace

  return a_copy

# End of your code
################################################################################

def print_mismatch(a, b):
    print 'Elementwise equals'
    mat_equals = a==b
    print  mat_equals
    print 'Reduced to rows'
    for i, outcome in enumerate(np.logical_and.reduce(mat_equals, 1)):
        print i, outcome

if __name__=='__main__':
    np.random.seed(31)
    shape = (20, 10)
    mat = np.asarray(a=np.random.binomial(1, p=0.5, size=shape), dtype=np.int32)
    mat.reshape(shape)
    runs = replace_runs(mat, 1, 3, 2)
    py = replace_py(mat, 1, 3, 2)

    print 'Original'
    print mat
    print 'replace_runs()'
    print runs
    print 'replace_py()'
    print py

    print 'Mismatch between replace_runs() and replace_py()'
    print_mismatch(runs, py)

在您的代码未修复之前,基准测试没有意义。所以我将使用我的replace_py()函数进行基准测试。

replace_py()实现,我认为你做了什么,不是pythonic,它有很多反模式。不过,这似乎是正确的。

定时:

np.random.seed(31)
shape = (100000, 10)
mat = np.asarray(a=np.random.binomial(1, p=0.5, size=shape), dtype=np.int32)
mat.reshape(shape)
%timeit replace_py(mat, 1, 3, 2)
1 loops, best of 3: 9.49 s per loop

用Cython

我不认为你的问题很容易被重写以使用Numpy和矢量化。也许Numpy guru可以做到这一点,但我担心代码会变得模糊或缓慢(或两者)。 To quote one of the Numpy developers:

  

[...]当要么需要NumPy-Phology的博士学位来进行矢量化时   解决方案或它导致太多的内存开销,你可以达到   Cython [...]

所以我使用typed memoryviews重新编写了replace_py()及其在Cython中调用的函数:

# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np

cdef inline int find_first(int[:] a, int index, int n, int value) nogil:
    while index<n and a[index]!=value:
        index += 1
    return index

cdef inline int find_end(int[:] a, int index, int n, int value) nogil:
    while index<n and a[index]==value:
        index += 1
    return index

cdef inline void replace_run(int[:] a, int begin, int end, int threshold, int replace) nogil:
    if end-begin+1 > threshold:
        for i in xrange(begin, end):
            a[i] = replace

cdef inline void process_row(int[:] a, int value, int threshold, int replace) nogil:
    cdef int first, end, n
    first = 0
    n = a.shape[0]
    while first < n:
        if a[first]==value:
            end = find_end(a, first, n, value)
            replace_run(a, first, end, threshold, replace)
            first = end
        else:
            first = find_first(a, first, n, value)

def replace_cy(np.ndarray[np.int32_t, ndim=2] a, int value, int length, int replace):
    cdef int[:, ::1] vmat
    cdef int i, n
    mat = a.copy()
    vmat = mat
    n = vmat.shape[0]
    for i in xrange(n):
        process_row(vmat[i], value, length, replace)
    return mat

它需要一些按摩,代码比上面给出的相应Python代码更混乱。但这并不是太多的工作,而且非常简单。

定时:

np.random.seed(31)
shape = (100000, 10)
mat = np.asarray(a=np.random.binomial(1, p=0.5, size=shape), dtype=np.int32)
mat.reshape(shape)
%timeit replace_cy(mat, 1, 3, 2)
100 loops, best of 3: 8.16 ms per loop

加速1163倍!


Numba

我有received help on Github,现在Numba版本也正常工作; 我刚刚将@autojit添加到纯Python代码a[begin:end] = replace除外,请参阅我在Github上讨论的解决方法。

import numpy as np
from numba import autojit

@autojit
def find_first(a, index, value):
    while index<a.size and a[index]!=value:
        index += 1
    return index

@autojit
def find_end(a, index, value):
    while index<a.size and a[index]==value:
        index += 1
    return index

@autojit
def replace_run(a, begin, end, threshold, replace):
    if end-begin+1 > threshold:
        for i in xrange(begin, end):
            a[i] = replace

@autojit        
def process_row(a, value, threshold, replace):
    first = 0
    while first < a.size:
        if a[first]==value:
            end = find_end(a, first, value)
            replace_run(a, first, end, threshold, replace)
            first = end
        else:
            first = find_first(a, first, value)

@autojit            
def replace_numba(a, value, length, replace):
    mat = a.copy()
    for row in mat:
        process_row(row, value, length, replace)
    return mat

时间(如上所述,通常输入,代码省略):

1 loops, best of 3: 86.5 ms per loop

与基本免费的纯Python代码相比,加速110倍! Numba版本仍然比Cython慢​​10倍,很可能是由于{{3}但是我认为在不弄乱我们的Python代码的情况下,基本上免费获得这种加速是令人惊讶的!

答案 4 :(得分:0)

这比OP略快,但仍然很难:

def replace2(originalM) :
    m = originalM.copy()
    for v in m :
        idx = 0
        for (key,n) in ( (key, sum(1 for _ in group)) for (key,group) in itertools.groupby(v) ) :
            if key and n>=3 :
                v[idx:idx+n] = 2
            idx += n
    return m

%%timeit
replace_runs(arr, 1, 3)
10000 loops, best of 3: 61.8 µs per loop

%%timeit
replace2(arr)
10000 loops, best of 3: 48 µs per loop

答案 5 :(得分:0)

toine的卷积方法也是一个很好的方法。根据{{​​3}},您可以使用these answers来获得所需内容。

from itertools import groupby, repeat, chain
run_length = 3
new_value = 2
# Groups the element by successive repetition
grouped = [(k, sum(1 for _ in v)) for k, v in groupby(arr[0])]
# [(0, 2), (1, 4), (0, 2), (1, 2), (0, 1), (1, 3)]
output = list(chain(*[list(repeat(k if v < run_length else new_value, v)) for k, v in grouped]))
# [0, 0, 2, 2, 2, 2, 0, 0, 1, 1, 0, 2, 2, 2]

你必须为arr中的每一行做这件事。如果你想要真正有效,你必须根据自己的需要调整它(例如删除列表创建)。

使用Paul在我链接的答案中给出的例子,你可以做一些事情:

import numpy as np
new_value = 2
run_length = 3
# Pad with values outside the possible values
diff = np.concatenate(([2], np.diff(arr[0]), [-1]))
# Get the array difference (every number substracted from the preceding)
idx_diff = np.where(diff)[0]
# Get values where groups are longer than 2 and value is 1
idx = np.where((np.diff(idx_diff) >= run_length) & arr[0][idx_diff[:-1]])[0]
# Set every group to its new value
for i in idx:
    arr[0][idx_diff[i]:idx_diff[i+1]] = new_value

这只是食物。使用这种方法,可以在一次运行中完成整个矩阵并在适当的位置修改数组,这应该是有效的。对不起这个想法的原始状态。我希望它能给你见解。一个好的加速提示是删除for循环。

当然,如果你想为了清晰起见而牺牲澄清。在我看来,在Python中你很少想要快速构思想法。如果你有一个必须快速的算法,请用C语言(或用Cython)编写,并在Python程序中使用它(使用ctypes或CFFI)。