我这里有一个特定的性能问题。我正在使用气象预报时间序列,我编译成一个numpy 2d数组,这样
现在,我希望dim0每小时一次,但有些消息来源只能每N小时产生预测。例如,假设N = 3并且dim1中的时间步长是M = 1小时。然后我得到像
这样的东西12:00 11.2 12.2 14.0 15.0 11.3 12.0
13:00 nan nan nan nan nan nan
14:00 nan nan nan nan nan nan
15:00 14.7 11.5 12.2 13.0 14.3 15.1
但当然也有信息在13:00和14:00,因为它可以从12:00预测运行填写。所以我想最终得到这样的东西:
12:00 11.2 12.2 14.0 15.0 11.3 12.0
13:00 12.2 14.0 15.0 11.3 12.0 nan
14:00 14.0 15.0 11.3 12.0 nan nan
15:00 14.7 11.5 12.2 13.0 14.3 15.1
到达那里的最快方法是什么,假设dim0的顺序为1e4,dim1的顺序为1e2?现在我一行一行,但这很慢:
nRows, nCols = dat.shape
if N >= M:
assert(N % M == 0) # must have whole numbers
for i in range(1, nRows):
k = np.array(np.where(np.isnan(self.dat[i, :])))
k = k[k < nCols - N] # do not overstep
self.dat[i, k] = self.dat[i-1, k+N]
我确定必须有更优雅的方式来做到这一点?任何提示都将不胜感激。
答案 0 :(得分:5)
看哪,布尔索引的力量!!!
def shift_nans(arr) :
while True:
nan_mask = np.isnan(arr)
write_mask = nan_mask[1:, :-1]
read_mask = nan_mask[:-1, 1:]
write_mask &= ~read_mask
if not np.any(write_mask):
return arr
arr[1:, :-1][write_mask] = arr[:-1, 1:][write_mask]
我认为命名是对发生的事情的自我解释。正确切割是一种痛苦,但它似乎正在起作用:
In [214]: shift_nans_bis(test_data)
Out[214]:
array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ],
[ 12.2, 14. , 15. , 11.3, 12. , nan],
[ 14. , 15. , 11.3, 12. , nan, nan],
[ 14.7, 11.5, 12.2, 13. , 14.3, 15.1],
[ 11.5, 12.2, 13. , 14.3, 15.1, nan],
[ 15.7, 16.5, 17.2, 18. , 14. , 12. ]])
时间安排:
tmp1 = np.random.uniform(-10, 20, (1e4, 1e2))
nan_idx = np.random.randint(30, 1e4 - 1,1e4)
tmp1[nan_idx] = np.nan
tmp1 = tmp.copy()
import timeit
t1 = timeit.timeit(stmt='shift_nans(tmp)',
setup='from __main__ import tmp, shift_nans',
number=1)
t2 = timeit.timeit(stmt='shift_time(tmp1)', # Ophion's code
setup='from __main__ import tmp1, shift_time',
number=1)
In [242]: t1, t2
Out[242]: (0.12696346416487359, 0.3427293070417363)
答案 1 :(得分:2)
使用a=yourdata[:,1:]
切片数据。
def shift_time(dat):
#Find number of required iterations
check=np.where(np.isnan(dat[:,0])==False)[0]
maxiters=np.max(np.diff(check))-1
#No sense in iterations where it just updates nans
cols=dat.shape[1]
if cols<maxiters: maxiters=cols-1
for iters in range(maxiters):
#Find nans
col_loc,row_loc=np.where(np.isnan(dat[:,:-1]))
dat[(col_loc,row_loc)]=dat[(col_loc-1,row_loc+1)]
a=np.array([[11.2,12.2,14.0,15.0,11.3,12.0],
[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[14.7,11.5,12.2,13.0,14.3,15.]])
shift_time(a)
print a
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ 12.2 14. 15. 11.3 12. nan]
[ 14. 15. 11.3 12. nan nan]
[ 14.7 11.5 12.2 13. 14.3 15. ]]
要按原样使用您的数据,或者可以稍微更改它以直接使用它,但这似乎是一种明确的方式来显示:
shift_time(yourdata[:,1:]) #Updates in place, no need to return anything.
使用tiago的测试:
tmp = np.random.uniform(-10, 20, (1e4, 1e2))
nan_idx = np.random.randint(30, 1e4 - 1,1e4)
tmp[nan_idx] = np.nan
t=time.time()
shift_time(tmp,maxiter=1E5)
print time.time()-t
0.364198923111 (seconds)
如果你真的很聪明,你应该能够使用一个np.where
。
答案 2 :(得分:1)
这似乎可以解决问题:
import numpy as np
def shift_time(dat):
NX, NY = dat.shape
for i in range(NY):
x, y = np.where(np.isnan(dat))
xr = x - 1
yr = y + 1
idx = (xr >= 0) & (yr < NY)
dat[x[idx], y[idx]] = dat[xr[idx], yr[idx]]
return
现在有一些测试数据:
In [1]: test_data = array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ],
[ nan, nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan, nan],
[ 14.7, 11.5, 12.2, 13. , 14.3, 15.1],
[ nan, nan, nan, nan, nan, nan],
[ 15.7, 16.5, 17.2, 18. , 14. , 12. ]])
In [2]: shift_time(test_data)
In [3]: print test_data
Out [3]:
array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ],
[ 12.2, 14. , 15. , 11.3, 12. , nan],
[ 14. , 15. , 11.3, 12. , nan, nan],
[ 14.7, 11.5, 12.2, 13. , 14.3, 15.1],
[ 11.5, 12.2, 13. , 14.3, 15.1, nan],
[ 15.7, 16.5, 17.2, 18. , 14. , 12. ]])
使用(1e4,1e2)阵列进行测试:
In [1]: tmp = np.random.uniform(-10, 20, (1e4, 1e2))
In [2]: nan_idx = np.random.randint(30, 1e4 - 1,1e4)
In [3]: tmp[nan_idx] = nan
In [4]: time test3(tmp)
CPU times: user 1.53 s, sys: 0.06 s, total: 1.59 s
Wall time: 1.59 s
答案 3 :(得分:0)
此pad,roll,roll组合的每次迭代基本上都能满足您的需求:
import numpy as np
from numpy import nan as nan
# Startup array
A = np.array([[11.2, 12.2, 14.0, 15.0, 11.3, 12.0],
[nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan],
[14.7, 11.5, 12.2, 13.0, 14.3, 15.1]])
def pad_nan(v, pad_width, iaxis, kwargs):
v[:pad_width[0]] = nan
v[-pad_width[1]:] = nan
return v
def roll_data(A):
idx = np.isnan(A)
A[idx] = np.roll(np.roll(np.pad(A,1, pad_nan),1,0), -1, 1)[1:-1,1:-1][idx]
return A
print A
print roll_data(A)
print roll_data(A)
输出结果为:
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ nan nan nan nan nan nan]
[ nan nan nan nan nan nan]
[ 14.7 11.5 12.2 13. 14.3 15.1]]
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ 12.2 14. 15. 11.3 12. nan]
[ nan nan nan nan nan nan]
[ 14.7 11.5 12.2 13. 14.3 15.1]]
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ 12.2 14. 15. 11.3 12. nan]
[ 14. 15. 11.3 12. nan nan]
[ 14.7 11.5 12.2 13. 14.3 15.1]]
一切都是纯粹的numpy所以每次迭代都应该非常快。但是我不确定创建填充数组和运行多次迭代的成本,如果你尝试它让我知道结果!