给出2D numpy数组:
00111100110111
01110011000110
00111110001000
01101101001110
是否有一种有效的方法来替换长度为1
的{{1}}次广告?
例如,如果>= N
N=3
实际上2D数组是二进制的,我想用0替换1的运行,但为了清楚起见,我在上面的例子中用2替换它们。
可运行示例:http://runnable.com/U6q0q-TFWzxVd_Uf/numpy-replace-runs-for-python
我目前使用的代码看起来有点hacky,我觉得可能有一些神奇的方式:
更新:我知道我将示例更改为不处理极端情况的版本。这是一个小的实现错误(现已修复)。如果有一种更快的方式,我更感兴趣。
00222200110222
02220011000110
00222220001000
01101101002220
输出:
import numpy as np
import time
def replace_runs(a, search, run_length, replace = 2):
a_copy = a.copy() # Don't modify original
for i, row in enumerate(a):
runs = []
current_run = []
for j, val in enumerate(row):
if val == search:
current_run.append(j)
else:
if len(current_run) >= run_length or j == len(row) -1:
runs.append(current_run)
current_run = []
if len(current_run) >= run_length or j == len(row) -1:
runs.append(current_run)
for run in runs:
for col in run:
a_copy[i][col] = replace
return a_copy
arr = np.array([
[0,0,1,1,1,1,0,0,1,1,0,1,1,1],
[0,1,1,1,0,0,1,1,0,0,0,1,1,0],
[0,0,1,1,1,1,1,0,0,0,1,0,0,0],
[0,1,1,0,1,1,0,1,0,0,1,1,1,0],
[1,1,1,1,1,1,1,1,1,1,1,1,1,1],
[0,0,0,0,0,0,0,0,0,0,0,0,0,0],
[1,1,1,1,1,1,1,1,1,1,1,1,1,0],
[0,1,1,1,1,1,1,1,1,1,1,1,1,1],
])
print arr
print replace_runs(arr, 1, 3)
iterations = 100000
t0 = time.time()
for i in range(0,iterations):
replace_runs(arr, 1, 3)
t1 = time.time()
print "replace_runs: %d iterations took %.3fs" % (iterations, t1 - t0)
答案 0 :(得分:1)
我认为输入是一维数组,因为它推广到两个维度。
在二进制文件中,您可以使用1
检查两个项目是否&
。在numpy,你可以"转移"通过切片有效地生成一个数组。因此,创建第二个数组,其中在您要取消设置(或更改为两个)的所有位置都有1
。然后^
或+
进入原始版本,具体取决于您是否要将0或者两个加入其中:
def unset_ones(a, n):
match = a[:-n].copy()
for i in range(1, n): # find 1s that have n-1 1s following
match &= a[i:i-n]
matchall = match.copy()
matchall.resize(match.size + n)
for i in range(1, n): # make the following n-1 1s as well
matchall[i:i-n] |= match
b = a.copy()
b ^= matchall # xor into the original data; replace by + to make 2s
return b
示例:
>>> unset_ones(np.array([0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0]), 3)
array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0])
答案 1 :(得分:1)
通过卷积使用模式匹配:
def replace_runs(a, N, replace = 2):
a_copy = a.copy()
pattern = np.ones(N, dtype=int)
M = a_copy.shape[1]
for i, row in enumerate(a_copy):
conv = np.convolve(row, pattern, mode='same')
match = np.where(conv==N)
a_copy[i][match]=replace
a_copy[i][match[0][match[0]-1>0]-1]=replace
a_copy[i][match[0][match[0]+1<M]+1]=replace
return a_copy
比原始replace_runs
慢3倍,但检测到极端情况(如建议的基于字符串的方法)。
在我的机器上:
replace_runs_org:100000次迭代需要12.792s
replace_runs_var:100000次迭代需要33.112次
答案 2 :(得分:1)
首先,您的代码无法正常工作......它正在2
替换第二行末尾只有两个1
的群集。也就是说,以下是您的文字所描述的内容:
def replace_runs_bis(arr, search=1, n=3, val=2):
ret = np.array(arr) # this makes a copy by default
rows, cols = arr.shape
# Fast convolution with an all 1's kernel
arr_cum = np.cumsum(arr == search, axis=1)
arr_win = np.empty((rows, cols-n+1), dtype=np.intp)
arr_win[:, 0] = arr_cum[:, n-1]
arr_win[:, 1:] = arr_cum[:, n:] - arr_cum[:, :-n]
mask_win = arr_win >= n
# mask_win is True for n item windows all full of searchs, expand to pixels
mask = np.zeros_like(arr, dtype=np.bool)
for j in range(n):
sl_end = -n+j+1
sl_end = sl_end if sl_end else None
mask[:, j:sl_end] |= mask_win
#replace values
ret[mask] = val
return ret
对于你的样本数组,它的速度提高了约2倍,但我猜它对于较大的数组会更快,只要n
保持较小。
In [23]: %timeit replace_runs(arr, 1, 3)
10000 loops, best of 3: 163 µs per loop
In [24]: %timeit replace_runs_bis(arr, 1, 3)
10000 loops, best of 3: 80.9 µs per loop
答案 3 :(得分:1)
您可能想要测试您的代码,它似乎没有达到预期效果。请运行此脚本,针对我的代码测试您的代码并检查输出:
import numpy as np
def find_first(a, index, value):
while index<a.size and a[index]!=value:
index += 1
return index
def find_end(a, index, value):
while index<a.size and a[index]==value:
index += 1
return index
def replace_run(a, begin, end, threshold, replace):
if end-begin+1 > threshold:
a[begin:end] = replace
def process_row(a, value, threshold, replace):
first = 0
while first < a.size:
if a[first]==value:
end = find_end(a, first, value)
replace_run(a, first, end, threshold, replace)
first = end
else:
first = find_first(a, first, value)
def replace_py(a, value, length, replace):
mat = a.copy()
for row in mat:
process_row(row, value, length, replace)
return mat
################################################################################
# Your code as posted in the question:
def replace_runs(a, search, run_length, replace = 2):
a_copy = a.copy() # Don't modify original
for i, row in enumerate(a):
runs = []
current_run = []
for j, val in enumerate(row):
if val == search:
current_run.append(j)
else:
if len(current_run) >= run_length or j == len(row) -1:
runs.append(current_run)
current_run = []
if len(current_run) >= run_length or j == len(row) -1:
runs.append(current_run)
for run in runs:
for col in run:
a_copy[i][col] = replace
return a_copy
# End of your code
################################################################################
def print_mismatch(a, b):
print 'Elementwise equals'
mat_equals = a==b
print mat_equals
print 'Reduced to rows'
for i, outcome in enumerate(np.logical_and.reduce(mat_equals, 1)):
print i, outcome
if __name__=='__main__':
np.random.seed(31)
shape = (20, 10)
mat = np.asarray(a=np.random.binomial(1, p=0.5, size=shape), dtype=np.int32)
mat.reshape(shape)
runs = replace_runs(mat, 1, 3, 2)
py = replace_py(mat, 1, 3, 2)
print 'Original'
print mat
print 'replace_runs()'
print runs
print 'replace_py()'
print py
print 'Mismatch between replace_runs() and replace_py()'
print_mismatch(runs, py)
在您的代码未修复之前,基准测试没有意义。所以我将使用我的replace_py()
函数进行基准测试。
replace_py()
实现,我认为你做了什么,不是pythonic,它有很多反模式。不过,这似乎是正确的。
定时:
np.random.seed(31)
shape = (100000, 10)
mat = np.asarray(a=np.random.binomial(1, p=0.5, size=shape), dtype=np.int32)
mat.reshape(shape)
%timeit replace_py(mat, 1, 3, 2)
1 loops, best of 3: 9.49 s per loop
我不认为你的问题很容易被重写以使用Numpy和矢量化。也许Numpy guru可以做到这一点,但我担心代码会变得模糊或缓慢(或两者)。 To quote one of the Numpy developers:
[...]当要么需要NumPy-Phology的博士学位来进行矢量化时 解决方案或它导致太多的内存开销,你可以达到 Cython [...]
所以我使用typed memoryviews重新编写了replace_py()
及其在Cython中调用的函数:
# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np
cdef inline int find_first(int[:] a, int index, int n, int value) nogil:
while index<n and a[index]!=value:
index += 1
return index
cdef inline int find_end(int[:] a, int index, int n, int value) nogil:
while index<n and a[index]==value:
index += 1
return index
cdef inline void replace_run(int[:] a, int begin, int end, int threshold, int replace) nogil:
if end-begin+1 > threshold:
for i in xrange(begin, end):
a[i] = replace
cdef inline void process_row(int[:] a, int value, int threshold, int replace) nogil:
cdef int first, end, n
first = 0
n = a.shape[0]
while first < n:
if a[first]==value:
end = find_end(a, first, n, value)
replace_run(a, first, end, threshold, replace)
first = end
else:
first = find_first(a, first, n, value)
def replace_cy(np.ndarray[np.int32_t, ndim=2] a, int value, int length, int replace):
cdef int[:, ::1] vmat
cdef int i, n
mat = a.copy()
vmat = mat
n = vmat.shape[0]
for i in xrange(n):
process_row(vmat[i], value, length, replace)
return mat
它需要一些按摩,代码比上面给出的相应Python代码更混乱。但这并不是太多的工作,而且非常简单。
定时:
np.random.seed(31)
shape = (100000, 10)
mat = np.asarray(a=np.random.binomial(1, p=0.5, size=shape), dtype=np.int32)
mat.reshape(shape)
%timeit replace_cy(mat, 1, 3, 2)
100 loops, best of 3: 8.16 ms per loop
加速1163倍!
我有received help on Github,现在Numba版本也正常工作; 我刚刚将@autojit
添加到纯Python代码,a[begin:end] = replace
除外,请参阅我在Github上讨论的解决方法。
import numpy as np
from numba import autojit
@autojit
def find_first(a, index, value):
while index<a.size and a[index]!=value:
index += 1
return index
@autojit
def find_end(a, index, value):
while index<a.size and a[index]==value:
index += 1
return index
@autojit
def replace_run(a, begin, end, threshold, replace):
if end-begin+1 > threshold:
for i in xrange(begin, end):
a[i] = replace
@autojit
def process_row(a, value, threshold, replace):
first = 0
while first < a.size:
if a[first]==value:
end = find_end(a, first, value)
replace_run(a, first, end, threshold, replace)
first = end
else:
first = find_first(a, first, value)
@autojit
def replace_numba(a, value, length, replace):
mat = a.copy()
for row in mat:
process_row(row, value, length, replace)
return mat
时间(如上所述,通常输入,代码省略):
1 loops, best of 3: 86.5 ms per loop
与基本免费的纯Python代码相比,加速110倍! Numba版本仍然比Cython慢10倍,很可能是由于{{3}但是我认为在不弄乱我们的Python代码的情况下,基本上免费获得这种加速是令人惊讶的!
答案 4 :(得分:0)
这比OP略快,但仍然很难:
def replace2(originalM) :
m = originalM.copy()
for v in m :
idx = 0
for (key,n) in ( (key, sum(1 for _ in group)) for (key,group) in itertools.groupby(v) ) :
if key and n>=3 :
v[idx:idx+n] = 2
idx += n
return m
%%timeit
replace_runs(arr, 1, 3)
10000 loops, best of 3: 61.8 µs per loop
%%timeit
replace2(arr)
10000 loops, best of 3: 48 µs per loop
答案 5 :(得分:0)
toine的卷积方法也是一个很好的方法。根据{{3}},您可以使用these answers来获得所需内容。
from itertools import groupby, repeat, chain
run_length = 3
new_value = 2
# Groups the element by successive repetition
grouped = [(k, sum(1 for _ in v)) for k, v in groupby(arr[0])]
# [(0, 2), (1, 4), (0, 2), (1, 2), (0, 1), (1, 3)]
output = list(chain(*[list(repeat(k if v < run_length else new_value, v)) for k, v in grouped]))
# [0, 0, 2, 2, 2, 2, 0, 0, 1, 1, 0, 2, 2, 2]
你必须为arr中的每一行做这件事。如果你想要真正有效,你必须根据自己的需要调整它(例如删除列表创建)。
使用Paul在我链接的答案中给出的例子,你可以做一些事情:
import numpy as np
new_value = 2
run_length = 3
# Pad with values outside the possible values
diff = np.concatenate(([2], np.diff(arr[0]), [-1]))
# Get the array difference (every number substracted from the preceding)
idx_diff = np.where(diff)[0]
# Get values where groups are longer than 2 and value is 1
idx = np.where((np.diff(idx_diff) >= run_length) & arr[0][idx_diff[:-1]])[0]
# Set every group to its new value
for i in idx:
arr[0][idx_diff[i]:idx_diff[i+1]] = new_value
这只是食物。使用这种方法,可以在一次运行中完成整个矩阵并在适当的位置修改数组,这应该是有效的。对不起这个想法的原始状态。我希望它能给你见解。一个好的加速提示是删除for循环。
当然,如果你想为了清晰起见而牺牲澄清。在我看来,在Python中你很少想要快速构思想法。如果你有一个必须快速的算法,请用C语言(或用Cython)编写,并在Python程序中使用它(使用ctypes或CFFI)。