我正在尝试查找一列中多个值的最近索引范围。我添加了python的最小工作代码,如下所示。我的数据范围大于此给定示例中的数据范围。它超过3000行。按照我的预期,下面的操作可以正常进行,但是处理时间只花了一点时间,大约需要50-60秒。
如何减少此时间?这种情况有其他方法吗?
否:我的模型X和Y值位于“列表”和“值”列中,
交叉的X值在“ obser”列中
尝试使用 FindNearest 函数
import numpy as np
import pandas as pd
from datetime import datetime as dt
def FindNearest(Table, Value):
idx = Table['list'].sub(Value).abs().idxmin() # find nearest index no
row_nrst = Table.loc[idx] # get all vals at nearest
# is value big (-1) or small (+1) from nearest, to decide second nearest val
updwn = -1 if Value > row_nrst['list'] else 1
# get model vals that value is between of them
lst1, lval1 = row_nrst[['list', 'vals']]
lst2, lval2 = Table.loc[idx+updwn, ['list', 'vals']]
#calculate observed Y val
rvals = lval1 + (lval2-lval1)*(Value-lst1)/(lst2-lst1)
return pd.Series([idx, rvals])
start = dt.now()
aa = np.matrix([
[ 15, 14, 13, 12, 11, 10, 9, 8], # model X vals
[ 100.5, 94.5, 88.5, 66.5, 74.5, 91.5, 105.5, 120.5], # model Y vals
[12.3, 14.6, 8.7, 13.5, 14.2, 9.4, 11.3, 11.5], # observed X vals
[-1, -1, -1, -1, -1, -1, -1, -1], # index of model X vals
[-1, -1, -1, -1, -1, -1, -1, -1] # calculalted observed Y vals
]).transpose()
tbl = pd.DataFrame(aa, columns=['list', 'vals', 'obser', 'ids', 'obsval'])
# finding process is peformed with **apply** function of Pandas library
tbl[['ids', 'obsval']] = tbl.apply(lambda x: FindNearest(tbl, x['obser']), axis=1)
elapsed = dt.now() - start
print(tbl)
print('Elapsed time :%2.3Fsn'%(elapsed.total_seconds()))
答案 0 :(得分:0)
这是一种方法,它需要对数据进行一次遍历以创建具有最接近索引的列。
这是您的原始代码。
import pandas as pd
import numpy as np
aa = np.matrix([
[ 15, 14, 13, 12, 11, 10, 9, 8], # model X vals
[ 100.5, 94.5, 88.5, 66.5, 74.5, 91.5, 105.5, 120.5], # model Y vals
[12.3, 14.6, 8.7, 13.5, 14.2, 9.4, 11.3, 11.5], # observed X vals
[-1, -1, -1, -1, -1, -1, -1, -1], # index of model X vals
[-1, -1, -1, -1, -1, -1, -1, -1] # calculated observed Y vals
]).transpose()
tbl = pd.DataFrame(aa, columns=['list', 'vals', 'obser', 'ids', 'obsval'])
然后,我添加了两个前哨行,分别作为第一行和最后一行。
bb = np.matrix([
[ -999, 999], # model X vals
[ 0.0, 0.0], # model Y vals
[float ("-inf"), float("inf")], # observed X vals
[-1, -1], # index of model X vals
[-1, -1] # calculated observed Y vals
]).transpose()
sentinel = pd.DataFrame(bb, columns=['list', 'vals', 'obser', 'ids', 'obsval'])
sentinel.head()
然后,我创建了一个数据框,将您的数据与已排序的前哨行合并。
df = tbl.append(sentinel, ignore_index=False)
df.sort_values('obser', inplace=True)
print (df.head(10))
以下是结果:
list vals obser ids obsval
0 -999.0 0.0 -inf -1.0 -1.0
2 13.0 88.5 8.7 -1.0 -1.0
5 10.0 91.5 9.4 -1.0 -1.0
6 9.0 105.5 11.3 -1.0 -1.0
7 8.0 120.5 11.5 -1.0 -1.0
0 15.0 100.5 12.3 -1.0 -1.0
3 12.0 66.5 13.5 -1.0 -1.0
4 11.0 74.5 14.2 -1.0 -1.0
1 14.0 94.5 14.6 -1.0 -1.0
1 999.0 0.0 inf -1.0 -1.0
最后,创建三行:prev,curr和next。比较观察者,然后选择两者中的较小者。
closest_idx = []
for i, row in enumerate(df[:2].itertuples()):
if i == 0:
print ('setting prev')
prev = row
if i == 1:
print ('setting curr')
curr = row
for i, next in enumerate(df[2:].itertuples()):
if (next.obser - curr.obser) > (curr.obser - prev.obser):
closer_idx = prev.Index
else:
closer_idx = next.Index
print (f'for row {i}, using {closer_idx}')
prev = curr
curr = next
closest_idx.append(closer_idx)
print (f'{closest_idx}')
这会打印出最接近的索引行:
[5, 2, 7, 6, 7, 4, 1, 4]