我经常需要查询大熊猫数据帧,我正在寻找找到执行这些查询的最优化方法。我处理线性参考系统(高速公路)。大多数道路属性都存储为线性事件,并通过路线ID以及起点和终点里程碑来索引。我正在寻找的是查询特定里程碑的道路属性。以下是存储道路属性的数据框示例:
import pandas as pd
idx = pd.MultiIndex.from_arrays(
[pd.Index(['FC','FC','FC','FC','OWNER','OWNER','OWNER','OWNER']),
pd.Index(['RID1','RID1','RID2','RID2','RID1','RID1','RID2','RID2']),
pd.IntervalIndex.from_arrays([0,1,10,11,0,1,10,11],
[1,2,11,12,1,2,11,12])])
idx.names = ['Item','RID','MP']
df = pd.DataFrame({'Value':[1,2,3,4,5,6,7,8]})
df.index = idx
Out[24]:
Value
Item RID MP
FC RID1 (0, 1] 1
(1, 2] 2
RID2 (10, 11] 3
(11, 12] 4
OWNER RID1 (0, 1] 5
(1, 2] 6
RID2 (10, 11] 7
(11, 12] 8
,以下是查询数据框的示例:
query_df = pd.DataFrame({
'Item':['FC' ,'OWNER','FC' ,'OWNER','OWNER'],
'RID' :['RID1','RID1' ,'RID1','RID2' ,'RID2' ],
'MP' :[0.2 ,1.5 ,1.6 ,11.1 ,10.9 ]})
Out[26]:
Item RID MP
0 FC RID1 0.2
1 OWNER RID1 1.5
2 FC RID1 1.6
3 OWNER RID2 11.1
4 OWNER RID2 10.9
我尝试了两种方法:
1)第一种方法:
for i in range(10000):
query_df['Value'] = query_df.apply(lambda r:df.Value.loc[r.Item].loc[r.RID].loc[r.MP],axis=1)
Wall time: 1min 18s
2)第二种方法:
df = df.sort_index()
idx = pd.MultiIndex.from_arrays([query_df.Item,query_df.RID,query_df.MP])
query_df.index = idx
query_df = query_df.sort_index()
for i in range(10000):
query_df['Value'] = df.Value.loc[idx]
Wall time: 16.8s
尽管第二种方法具有良好的性能,但是如果间隔为float而不是整数,则它将无法正常工作。例如,如果我将df更改为以下内容:
Out[24]:
Value
Item RID MP
FC RID1 (0, 1.2] 1
(1.2, 2] 2
RID2 (10, 11] 3
(11, 12] 4
OWNER RID1 (0, 1] 5
(1, 2] 6
RID2 (10, 11] 7
(11, 12] 8
第一种方法仍然可以正常工作,但是第二种方法返回以下错误:
ValueError: setting an array element with a sequence.
更正查询结果:
query_df
Out[18]:
Item RID MP Value
0 FC RID1 0.2 1
1 OWNER RID1 1.5 6
2 FC RID1 1.6 2
3 OWNER RID2 11.1 8
4 OWNER RID2 10.9 7
有什么想法如何将第二种方法应用于浮动间隔或任何其他与浮动间隔有效且时间类似于第二种方法的方法?
版本:
print(pd.__version__)
0.23.4
print(sys.version)
3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]