在numpy数组中找到连续重复nan的最大数量的最佳方法是什么?
示例:
from numpy import nan
输入1:[nan, nan, nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, nan, 0.16]
输出1:3
输入2:[nan, nan, 2, 1, 1, nan, nan, nan, nan, 0.101, nan, 0.16]
输出2:4
答案 0 :(得分:5)
这是一种方法 -
def max_repeatedNaNs(a):
# Mask of NaNs
mask = np.concatenate(([False],np.isnan(a),[False]))
if ~mask.any():
return 0
else:
# Count of NaNs in each NaN group. Then, get max count as o/p.
c = np.flatnonzero(mask[1:] < mask[:-1]) - \
np.flatnonzero(mask[1:] > mask[:-1])
return c.max()
这是一个改进版本 -
def max_repeatedNaNs_v2(a):
mask = np.concatenate(([False],np.isnan(a),[False]))
if ~mask.any():
return 0
else:
idx = np.nonzero(mask[1:] != mask[:-1])[0]
return (idx[1::2] - idx[::2]).max()
的基准测试
In [77]: a = np.random.rand(10000)
In [78]: a[np.random.choice(range(len(a)),size=1000,replace=0)] = np.nan
In [79]: %timeit contiguous_NaN(a) #@pltrdy's solution
100 loops, best of 3: 15.8 ms per loop
In [80]: %timeit max_repeatedNaNs(a)
10000 loops, best of 3: 103 µs per loop
In [81]: %timeit max_repeatedNaNs_v2(a)
10000 loops, best of 3: 86.4 µs per loop
答案 1 :(得分:4)
我不知道你是否有numba,但它对于这些特殊问题非常方便(而且速度快):
import numba as nb
import math
@nb.njit # also works without but then it's several orders of magnitudes slower
def max_consecutive_nan(arr):
max_ = 0
current = 0
idx = 0
while idx < arr.size:
while idx < arr.size and math.isnan(arr[idx]):
current += 1
idx += 1
if current > max_:
max_ = current
current = 0
idx += 1
return max_
对于您的示例:
>>> from numpy import nan
>>> max_consecutive_nan(np.array([nan, nan, 2, 1, 1, nan, nan, nan, nan, 0.101, nan, 0.16]))
4
>>> max_consecutive_nan(np.array([nan, nan, nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, nan, 0.16]))
3
>>> max_consecutive_nan(np.array([0.16, 0.16, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]))
22
使用@Divarkar提出的基准并按性能排序(基准测试的完整代码可在此gist中找到):
arr = np.random.rand(10000)
arr[np.random.choice(range(len(arr)),size=1000,replace=0)] = np.nan
%timeit mine(arr) # 10000 loops, best of 3: 67.7 µs per loop
%timeit Divakar_v2(arr) # 1000 loops, best of 3: 196 µs per loop
%timeit Divakar(arr) # 1000 loops, best of 3: 252 µs per loop
%timeit Tagc(arr) # 100 loops, best of 3: 6.92 ms per loop
%timeit Kasramvd(arr) # 10 loops, best of 3: 38.2 ms per loop
%timeit pltrdy(arr) # 10 loops, best of 3: 70.9 ms per loop
答案 2 :(得分:1)
我根据itertools
发布了另一个答案,但我相信这个答案更好:
from itertools import groupby
from numpy import nan
def longest_nan_run(sequence):
return max((sum(1 for _ in group) for key, group in groupby(sequence) if key is nan), default=0)
if __name__ == '__main__':
array1 = [nan, nan, nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, nan, 0.16]
array2 = [nan, nan, 2, 1, 1, nan, nan, nan, nan, 0.101, nan, 0.16]
print(longest_nan_run(array1)) # 3
print(longest_nan_run(array2)) # 4
print(longest_nan_run([])) # 0
print(longest_nan_run([1, 2])) # 0
编辑:现在处理没有nan
值的情况(感谢MSeifert将其指出)。
答案 3 :(得分:1)
性能改善是可能的,特别是当存在长纳米序列时。 在这些情况下,无需测试所有值。
使用@MSeifert方法和符号,如果在max_
长度块中出现任何漏洞,则数组可以通过max_
而不是一个步骤进行扫描:
@nb.njit
def max_consecutive_nan2(arr):
max_ = 0
idx = 0
while idx < arr.size:
while idx < arr.size and math.isnan(arr[idx]): # amelioration
max_ += 1
idx += 1
while idx < arr.size - max_:
idx2 = idx + max_
while idx2>idx and math.isnan(arr[idx2]):
idx2 -=1
if idx2==idx: # record reached.
idx = idx + max_ +1
break # goto amelioration
idx=idx2 # skip unuseful tests
else : return max_
return max_ #case record at end.
结果:
arr = np.random.rand(10000)
arr[np.random.choice(range(len(arr)),size=4000,replace=0)] = np.nan
In [25]: max_consecutive_nan(arr)
Out[25]: 14
In [26]: max_consecutive_nan2(arr)
Out[26]: 14
表演:
In [27]: %timeit max_consecutive_nan2(arr)
100000 loops, best of 3: 3.29 µs per loop
In [28]: %timeit max_consecutive_nan(arr) # MSeifert
10000 loops, best of 3: 68.5 µs per loop
答案 4 :(得分:0)
这是我的解决方案。
计算复杂度为<?xml version="1.0"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:math="http://exslt.org/math">
...
O(n)
,空格为n = len(arr)
O(1)
编辑:请记住,您的代码的重点是:
答案 5 :(得分:0)
另一种易于阅读和理解的方法是使用字符串,然后str.split
:
array2 = [nan, nan, 2, 1, 1, nan, nan, nan, nan, 0.101, nan, 0.16]
thestring=isnan(array2).tobytes().decode()
#'\x01\x01\x00\x00\x00\x01\x01\x01\x01\x00\x01\x00'
m=max(len(c) for c in thestring.split('\x00'))
# 4
答案 6 :(得分:0)
这可以在NumPy中非常有效地完成,而无需使用任何循环。
如果我们调用序列x
,那么我们可以找到最大数量的nan
:{/ p>
np.max(np.diff(np.concatenate(([-1], np.where(-np.isnan(x))[0], [len(x)]))) - 1)