纯Python

Question

给出2D numpy数组：

00111100110111
01110011000110
00111110001000
01101101001110

是否有一种有效的方法来替换长度为1的{{1}}次广告？

例如，如果>= N

N=3

实际上2D数组是二进制的，我想用0替换1的运行，但为了清楚起见，我在上面的例子中用2替换它们。

可运行示例：http://runnable.com/U6q0q-TFWzxVd_Uf/numpy-replace-runs-for-python

我目前使用的代码看起来有点hacky，我觉得可能有一些神奇的方式：

更新：我知道我将示例更改为不处理极端情况的版本。这是一个小的实现错误（现已修复）。如果有一种更快的方式，我更感兴趣。

00222200110222
02220011000110
00222220001000
01101101002220

输出：

import numpy as np
import time

def replace_runs(a, search, run_length, replace = 2):
  a_copy = a.copy() # Don't modify original
  for i, row in enumerate(a):
    runs = []
    current_run = []
    for j, val in enumerate(row):
      if val == search:
        current_run.append(j)
      else:
        if len(current_run) >= run_length or j == len(row) -1:
          runs.append(current_run)
        current_run = []

    if len(current_run) >= run_length or j == len(row) -1:
      runs.append(current_run)

    for run in runs:
      for col in run:
        a_copy[i][col] = replace

  return a_copy

arr = np.array([
  [0,0,1,1,1,1,0,0,1,1,0,1,1,1],
  [0,1,1,1,0,0,1,1,0,0,0,1,1,0],
  [0,0,1,1,1,1,1,0,0,0,1,0,0,0],
  [0,1,1,0,1,1,0,1,0,0,1,1,1,0],
  [1,1,1,1,1,1,1,1,1,1,1,1,1,1],
  [0,0,0,0,0,0,0,0,0,0,0,0,0,0],
  [1,1,1,1,1,1,1,1,1,1,1,1,1,0],
  [0,1,1,1,1,1,1,1,1,1,1,1,1,1],
])

print arr
print replace_runs(arr, 1, 3)

iterations = 100000

t0 = time.time()
for i in range(0,iterations):
  replace_runs(arr, 1, 3)
t1 = time.time()

print "replace_runs: %d iterations took %.3fs" % (iterations, t1 - t0)

Answer 1

我认为输入是一维数组，因为它推广到两个维度。

在二进制文件中，您可以使用1检查两个项目是否&。在numpy，你可以＆＃34;转移＆＃34;通过切片有效地生成一个数组。因此，创建第二个数组，其中在您要取消设置（或更改为两个）的所有位置都有1。然后^或+进入原始版本，具体取决于您是否要将0或者两个加入其中：

def unset_ones(a, n):
    match = a[:-n].copy()
    for i in range(1, n): # find 1s that have n-1 1s following
        match &= a[i:i-n]
    matchall = match.copy()
    matchall.resize(match.size + n)
    for i in range(1, n): # make the following n-1 1s as well
        matchall[i:i-n] |= match
    b = a.copy()
    b ^= matchall # xor into the original data; replace by + to make 2s
    return b

示例：

>>> unset_ones(np.array([0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0]), 3)
array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0])

Answer 2

通过卷积使用模式匹配：

def replace_runs(a, N, replace = 2):
    a_copy = a.copy()
    pattern = np.ones(N, dtype=int)
    M = a_copy.shape[1]

    for i, row in enumerate(a_copy):
        conv = np.convolve(row, pattern, mode='same')
        match = np.where(conv==N)

        a_copy[i][match]=replace
        a_copy[i][match[0][match[0]-1>0]-1]=replace
        a_copy[i][match[0][match[0]+1<M]+1]=replace
    return a_copy

比原始replace_runs慢3倍，但检测到极端情况（如建议的基于字符串的方法）。

在我的机器上：

replace_runs_org：100000次迭代需要12.792s

replace_runs_var：100000次迭代需要33.112次

Answer 3

首先，您的代码无法正常工作......它正在2替换第二行末尾只有两个1的群集。也就是说，以下是您的文字所描述的内容：

def replace_runs_bis(arr, search=1, n=3, val=2):
    ret = np.array(arr) # this makes a copy by default
    rows, cols = arr.shape
    # Fast convolution with an all 1's kernel
    arr_cum = np.cumsum(arr == search, axis=1)
    arr_win = np.empty((rows, cols-n+1), dtype=np.intp)
    arr_win[:, 0] = arr_cum[:, n-1]
    arr_win[:, 1:] = arr_cum[:, n:] - arr_cum[:, :-n]
    mask_win = arr_win >= n
    # mask_win is True for n item windows all full of searchs, expand to pixels
    mask = np.zeros_like(arr, dtype=np.bool)
    for j in range(n):
        sl_end = -n+j+1
        sl_end = sl_end if sl_end else None
        mask[:, j:sl_end] |= mask_win
    #replace values
    ret[mask] = val

    return ret

对于你的样本数组，它的速度提高了约2倍，但我猜它对于较大的数组会更快，只要n保持较小。

In [23]: %timeit replace_runs(arr, 1, 3)
10000 loops, best of 3: 163 µs per loop

In [24]: %timeit replace_runs_bis(arr, 1, 3)
10000 loops, best of 3: 80.9 µs per loop

Answer 4

纯Python

您可能想要测试您的代码，它似乎没有达到预期效果。请运行此脚本，针对我的代码测试您的代码并检查输出：

import numpy as np

def find_first(a, index, value):
    while index<a.size and a[index]!=value:
        index += 1
    return index

def find_end(a, index, value):
    while index<a.size and a[index]==value:
        index += 1
    return index

def replace_run(a, begin, end, threshold, replace):
    if end-begin+1 > threshold:
        a[begin:end] = replace

def process_row(a, value, threshold, replace):
    first = 0
    while first < a.size:
        if a[first]==value:
            end = find_end(a, first, value)
            replace_run(a, first, end, threshold, replace)
            first = end
        else:
            first = find_first(a, first, value)

def replace_py(a, value, length, replace):
    mat = a.copy()
    for row in mat:
        process_row(row, value, length, replace)
    return mat

################################################################################
# Your code as posted in the question:

def replace_runs(a, search, run_length, replace = 2):
  a_copy = a.copy() # Don't modify original
  for i, row in enumerate(a):
    runs = []
    current_run = []
    for j, val in enumerate(row):
      if val == search:
        current_run.append(j)
      else:
        if len(current_run) >= run_length or j == len(row) -1:
          runs.append(current_run)
        current_run = []

    if len(current_run) >= run_length or j == len(row) -1:
      runs.append(current_run)

    for run in runs:
      for col in run:
        a_copy[i][col] = replace

  return a_copy

# End of your code
################################################################################

def print_mismatch(a, b):
    print 'Elementwise equals'
    mat_equals = a==b
    print  mat_equals
    print 'Reduced to rows'
    for i, outcome in enumerate(np.logical_and.reduce(mat_equals, 1)):
        print i, outcome

if __name__=='__main__':
    np.random.seed(31)
    shape = (20, 10)
    mat = np.asarray(a=np.random.binomial(1, p=0.5, size=shape), dtype=np.int32)
    mat.reshape(shape)
    runs = replace_runs(mat, 1, 3, 2)
    py = replace_py(mat, 1, 3, 2)

    print 'Original'
    print mat
    print 'replace_runs()'
    print runs
    print 'replace_py()'
    print py

    print 'Mismatch between replace_runs() and replace_py()'
    print_mismatch(runs, py)

在您的代码未修复之前，基准测试没有意义。所以我将使用我的replace_py()函数进行基准测试。

replace_py()实现，我认为你做了什么，不是pythonic，它有很多反模式。不过，这似乎是正确的。

定时：

np.random.seed(31)
shape = (100000, 10)
mat = np.asarray(a=np.random.binomial(1, p=0.5, size=shape), dtype=np.int32)
mat.reshape(shape)
%timeit replace_py(mat, 1, 3, 2)
1 loops, best of 3: 9.49 s per loop

的用Cython

我不认为你的问题很容易被重写以使用Numpy和矢量化。也许Numpy guru可以做到这一点，但我担心代码会变得模糊或缓慢（或两者）。 To quote one of the Numpy developers:

[...]当要么需要NumPy-Phology的博士学位来进行矢量化时解决方案或它导致太多的内存开销，你可以达到 Cython [...]

所以我使用typed memoryviews重新编写了replace_py()及其在Cython中调用的函数：

# cython: infer_types=True # cython: boundscheck=False # cython: wraparound=False import numpy as np cimport numpy as np cdef inline int find_first(int[:] a, int index, int n, int value) nogil: while index<n and a[index]!=value: index += 1 return index cdef inline int find_end(int[:] a, int index, int n, int value) nogil: while index<n and a[index]==value: index += 1 return index cdef inline void replace_run(int[:] a, int begin, int end, int threshold, int replace) nogil: if end-begin+1 > threshold: for i in xrange(begin, end): a[i] = replace cdef inline void process_row(int[:] a, int value, int threshold, int replace) nogil: cdef int first, end, n first = 0 n = a.shape[0] while first < n: if a[first]==value: end = find_end(a, first, n, value) replace_run(a, first, end, threshold, replace) first = end else: first = find_first(a, first, n, value) def replace_cy(np.ndarray[np.int32_t, ndim=2] a, int value, int length, int replace): cdef int[:, ::1] vmat cdef int i, n mat = a.copy() vmat = mat n = vmat.shape[0] for i in xrange(n): process_row(vmat[i], value, length, replace) return mat

它需要一些按摩，代码比上面给出的相应Python代码更混乱。但这并不是太多的工作，而且非常简单。

定时：

np.random.seed(31) shape = (100000, 10) mat = np.asarray(a=np.random.binomial(1, p=0.5, size=shape), dtype=np.int32) mat.reshape(shape) %timeit replace_cy(mat, 1, 3, 2) 100 loops, best of 3: 8.16 ms per loop

加速1163倍！

的 Numba

我有received help on Github，现在Numba版本也正常工作; 我刚刚将@autojit添加到纯Python代码，a[begin:end] = replace除外，请参阅我在Github上讨论的解决方法。

import numpy as np from numba import autojit @autojit def find_first(a, index, value): while index<a.size and a[index]!=value: index += 1 return index @autojit def find_end(a, index, value): while index<a.size and a[index]==value: index += 1 return index @autojit def replace_run(a, begin, end, threshold, replace): if end-begin+1 > threshold: for i in xrange(begin, end): a[i] = replace @autojit def process_row(a, value, threshold, replace): first = 0 while first < a.size: if a[first]==value: end = find_end(a, first, value) replace_run(a, first, end, threshold, replace) first = end else: first = find_first(a, first, value) @autojit def replace_numba(a, value, length, replace): mat = a.copy() for row in mat: process_row(row, value, length, replace) return mat

时间（如上所述，通常输入，代码省略）：

1 loops, best of 3: 86.5 ms per loop

与基本免费的纯Python代码相比，加速110倍！ Numba版本仍然比Cython慢10倍，很可能是由于{{3}但是我认为在不弄乱我们的Python代码的情况下，基本上免费获得这种加速是令人惊讶的！

Answer 5

这比OP略快，但仍然很难：

def replace2(originalM) :
    m = originalM.copy()
    for v in m :
        idx = 0
        for (key,n) in ( (key, sum(1 for _ in group)) for (key,group) in itertools.groupby(v) ) :
            if key and n>=3 :
                v[idx:idx+n] = 2
            idx += n
    return m

%%timeit
replace_runs(arr, 1, 3)
10000 loops, best of 3: 61.8 µs per loop

%%timeit
replace2(arr)
10000 loops, best of 3: 48 µs per loop

Answer 6

toine的卷积方法也是一个很好的方法。根据{{3}}，您可以使用these answers来获得所需内容。

from itertools import groupby, repeat, chain
run_length = 3
new_value = 2
# Groups the element by successive repetition
grouped = [(k, sum(1 for _ in v)) for k, v in groupby(arr[0])]
# [(0, 2), (1, 4), (0, 2), (1, 2), (0, 1), (1, 3)]
output = list(chain(*[list(repeat(k if v < run_length else new_value, v)) for k, v in grouped]))
# [0, 0, 2, 2, 2, 2, 0, 0, 1, 1, 0, 2, 2, 2]

你必须为arr中的每一行做这件事。如果你想要真正有效，你必须根据自己的需要调整它（例如删除列表创建）。

使用Paul在我链接的答案中给出的例子，你可以做一些事情：

import numpy as np
new_value = 2
run_length = 3
# Pad with values outside the possible values
diff = np.concatenate(([2], np.diff(arr[0]), [-1]))
# Get the array difference (every number substracted from the preceding)
idx_diff = np.where(diff)[0]
# Get values where groups are longer than 2 and value is 1
idx = np.where((np.diff(idx_diff) >= run_length) & arr[0][idx_diff[:-1]])[0]
# Set every group to its new value
for i in idx:
    arr[0][idx_diff[i]:idx_diff[i+1]] = new_value

这只是食物。使用这种方法，可以在一次运行中完成整个矩阵并在适当的位置修改数组，这应该是有效的。对不起这个想法的原始状态。我希望它能给你见解。一个好的加速提示是删除for循环。

当然，如果你想为了清晰起见而牺牲澄清。在我看来，在Python中你很少想要快速构思想法。如果你有一个必须快速的算法，请用C语言（或用Cython）编写，并在Python程序中使用它（使用ctypes或CFFI）。

从2D numpy数组中删除运行

6 个答案:

纯Python

的用Cython

的 Numba