作为一个简单的例子,考虑如下定义的numpy数组arr
:
import numpy as np
arr = np.array([[5, np.nan, np.nan, 7, 2],
[3, np.nan, 1, 8, np.nan],
[4, 9, 6, np.nan, np.nan]])
其中arr
在控制台输出中如下所示:
array([[ 5., nan, nan, 7., 2.],
[ 3., nan, 1., 8., nan],
[ 4., 9., 6., nan, nan]])
我现在想逐行“向前填充”数组nan
中的arr
值。我的意思是用左边最近的有效值替换每个nan
值。期望的结果如下所示:
array([[ 5., 5., 5., 7., 2.],
[ 3., 3., 1., 8., 8.],
[ 4., 9., 6., 6., 6.]])
我尝试过使用for循环:
for row_idx in range(arr.shape[0]):
for col_idx in range(arr.shape[1]):
if np.isnan(arr[row_idx][col_idx]):
arr[row_idx][col_idx] = arr[row_idx][col_idx - 1]
我也尝试使用pandas数据帧作为中间步骤(因为pandas数据帧有一个非常简洁的内置前向填充方法):
import pandas as pd
df = pd.DataFrame(arr)
df.fillna(method='ffill', axis=1, inplace=True)
arr = df.as_matrix()
上述两种策略都会产生预期的结果,但我一直想知道:只使用numpy矢量化操作的策略不是最有效的策略吗?
是否有另一种更有效的方法可以在numpy数组中“转发”nan
值? (例如,通过使用numpy向量化操作)
到目前为止,我试图计算所有解决方案的时间。这是我的设置脚本:
import numba as nb
import numpy as np
import pandas as pd
def random_array():
choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]
out = np.random.choice(choices, size=(1000, 10))
return out
def loops_fill(arr):
out = arr.copy()
for row_idx in range(out.shape[0]):
for col_idx in range(1, out.shape[1]):
if np.isnan(out[row_idx, col_idx]):
out[row_idx, col_idx] = out[row_idx, col_idx - 1]
return out
@nb.jit
def numba_loops_fill(arr):
'''Numba decorator solution provided by shx2.'''
out = arr.copy()
for row_idx in range(out.shape[0]):
for col_idx in range(1, out.shape[1]):
if np.isnan(out[row_idx, col_idx]):
out[row_idx, col_idx] = out[row_idx, col_idx - 1]
return out
def pandas_fill(arr):
df = pd.DataFrame(arr)
df.fillna(method='ffill', axis=1, inplace=True)
out = df.as_matrix()
return out
def numpy_fill(arr):
'''Solution provided by Divakar.'''
mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]
return out
后跟此控制台输入:
%timeit -n 1000 loops_fill(random_array())
%timeit -n 1000 numba_loops_fill(random_array())
%timeit -n 1000 pandas_fill(random_array())
%timeit -n 1000 numpy_fill(random_array())
导致此控制台输出:
1000 loops, best of 3: 9.64 ms per loop
1000 loops, best of 3: 377 µs per loop
1000 loops, best of 3: 455 µs per loop
1000 loops, best of 3: 351 µs per loop
答案 0 :(得分:29)
这是一种方法 -
mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]
如果您不想创建另一个数组,只需填写arr
本身的NaN,请用此替换最后一步 -
arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]
示例输入,输出 -
In [179]: arr
Out[179]:
array([[ 5., nan, nan, 7., 2., 6., 5.],
[ 3., nan, 1., 8., nan, 5., nan],
[ 4., 9., 6., nan, nan, nan, 7.]])
In [180]: out
Out[180]:
array([[ 5., 5., 5., 7., 2., 6., 5.],
[ 3., 3., 1., 8., 8., 5., 5.],
[ 4., 9., 6., 6., 6., 6., 7.]])
答案 1 :(得分:4)
使用Numba。这应该会带来显着的加速:
import numba
@numba.jit
def loops_fill(arr):
...
答案 2 :(得分:1)
对于那些对在填空后领先np.nan
的问题感兴趣的人,可以进行以下工作:
mask = np.isnan(arr)
first_non_zero_idx = (~mask!=0).argmax(axis=1) #Get indices of first non-zero values
arr = [ np.hstack([
[arr[i,first_nonzero]]*(first_nonzero),
arr[i,first_nonzero:]])
for i, first_nonzero in enumerate(first_non_zero_idx) ]
答案 3 :(得分:1)
对于那些来这里寻找NaN值向后填充的人,我修改了the solution provided by Divakar above来做到这一点。诀窍是,您必须使用除最大值以外的最小值对反向数组进行累加。
代码如下:
# As provided in the answer by Divakar
def ffill(arr):
mask = np.isnan(arr)
idx = np.where(~mask, np.arange(mask.shape[1]), 0)
np.maximum.accumulate(idx, axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]
return out
# My modification to do a backward-fill
def bfill(arr):
mask = np.isnan(arr)
idx = np.where(~mask, np.arange(mask.shape[1]), mask.shape[0] + 1)
idx = np.minimum.accumulate(idx[:, ::-1], axis=1)[:, ::-1]
out = arr[np.arange(idx.shape[0])[:,None], idx]
return out
# Test both functions
arr = np.array([[5, np.nan, np.nan, 7, 2],
[3, np.nan, 1, 8, np.nan],
[4, 9, 6, np.nan, np.nan]])
print('Array:')
print(arr)
print('\nffill')
print(ffill(arr))
print('\nbfill')
print(bfill(arr))
输出:
Array:
[[ 5. nan nan 7. 2.]
[ 3. nan 1. 8. nan]
[ 4. 9. 6. nan nan]]
ffill
[[5. 5. 5. 7. 2.]
[3. 3. 1. 8. 8.]
[4. 9. 6. 6. 6.]]
bfill
[[ 5. 7. 7. 7. 2.]
[ 3. 1. 1. 8. nan]
[ 4. 9. 6. nan nan]]
答案 4 :(得分:1)
我喜欢Divakar关于纯粹的numpy的回答。 这是n维数组的通用函数:
def np_ffill(arr, axis):
idx_shape = tuple([slice(None)] + [np.newaxis] * (len(arr.shape) - axis - 1))
idx = np.where(~np.isnan(arr), np.arange(arr.shape[axis])[idx_shape], 0)
np.maximum.accumulate(idx, axis=axis, out=idx)
slc = [np.arange(k)[tuple([slice(None) if dim==i else np.newaxis
for dim in range(len(arr.shape))])]
for i, k in enumerate(arr.shape)]
slc[axis] = idx
return arr[tuple(slc)]
AFIK大熊猫只能使用二维空间,尽管需要多索引来弥补。实现此目的的唯一方法是将DataFrame展平,取消堆叠所需的级别,重新堆叠并最终重新成形为原始形状。涉及到熊猫分拣的这种拆堆/堆垛/整形只是实现相同结果的不必要的开销。
测试:
def random_array(shape):
choices = [1, 2, 3, 4, np.nan]
out = np.random.choice(choices, size=shape)
return out
ra = random_array((2, 4, 8))
print('arr')
print(ra)
print('\nffull')
print(np_ffill(ra, 1))
raise SystemExit
输出:
arr
[[[ 3. nan 4. 1. 4. 2. 2. 3.]
[ 2. nan 1. 3. nan 4. 4. 3.]
[ 3. 2. nan 4. nan nan 3. 4.]
[ 2. 2. 2. nan 1. 1. nan 2.]]
[[ 2. 3. 2. nan 3. 3. 3. 3.]
[ 3. 3. 1. 4. 1. 4. 1. nan]
[ 4. 2. nan 4. 4. 3. nan 4.]
[ 2. 4. 2. 1. 4. 1. 3. nan]]]
ffull
[[[ 3. nan 4. 1. 4. 2. 2. 3.]
[ 2. nan 1. 3. 4. 4. 4. 3.]
[ 3. 2. 1. 4. 4. 4. 3. 4.]
[ 2. 2. 2. 4. 1. 1. 3. 2.]]
[[ 2. 3. 2. nan 3. 3. 3. 3.]
[ 3. 3. 1. 4. 1. 4. 1. 3.]
[ 4. 2. 1. 4. 4. 3. 1. 4.]
[ 2. 4. 2. 1. 4. 1. 3. 4.]]]
答案 5 :(得分:1)
我喜欢 Divakar 的回答,但它不适用于一行以 np.nan 开头的边缘情况,例如下面的 arr
arr = np.array([[9, np.nan, 4, np.nan, 6, 6, 7, 2, 3, np.nan],
[ np.nan, 5, 5, 6, 5, 3, 2, 1, np.nan, 10]])
使用 Divakar 代码的输出将是:
[[ 9. 9. 4. 4. 6. 6. 7. 2. 3. 3.]
[nan 4. 5. 6. 5. 3. 2. 1. 1. 10.]]
Divakar 的代码可以稍微简化一下,简化版同时解决了这个问题:
arr[np.isnan(arr)] = arr[np.nonzero(np.isnan(arr))[0], np.nonzero(np.isnan(arr))[1]-1]
答案 6 :(得分:-1)