我有一本字典,其中包含有关单个职位的信息:position_info
以及有关功能feature_info
的信息。我必须找到位置所在的特征(可以是多个),以便我可以对位置进行注释。我现在用的是:
feature_info = [[1, 10, 'a'],[15, 30, 'b'],[40, 60, 'c'],[55, 71, 'd'],[73, 84, 'e']]
position_info = {5:'some info', 16:'some other info', 75:'last info'}
for pos in position_info.keys():
for info in feature_info:
if info[0] <= pos < info[1]:
print(pos, position_info[pos], info[2])
问题在于feature_info
包含800k +功能和position_info
150k位置,这非常慢。我可以自己优化一下,但可能已经有了比我更好的方法,但我还没有找到它们。
因此,举例来说,这是我能想到的一种加快速度的方法:
for info in feature_info:
for pos in position_info.keys():
if info[0] <= pos < info[1]:
print(pos, position_info[pos], info[2])
if pos > info[1]:
break
如果我订购的位置,当位置大于特征的结束位置时我可以破坏(如果我确保那些也被订购)。但是,必须有更好的方法来做到这一点。
如何以最快的方式实现这一目标?
import timeit
setup = """
from bisect import bisect
import pandas as pd
import random
import numpy as np
position_info = {}
random_number = random.sample(range(9000), 8000)
random_feature_start = random.sample(range(90000), 5000)
random_feature_length = np.random.choice(1000, 5000, replace=True)
for i in random_number:
position_info[i] = 'test'
feature_info = []
for index, i in enumerate(random_feature_start):
feature_info.append([i, i+random_feature_length[index],'test'])
"""
p1 = """
sections = sorted(r for a, b, c in feature_info for r in (a,b))
for pos in position_info:
feature_info[int(bisect(sections, pos) / 2)]
"""
p2 = """
# feature info to dataframe
feature_df = pd.DataFrame(feature_info)
# rename feature df columns
feature_df.rename(index=str, columns={0: "start", 1: "end",2:'name'}, inplace=True)
# positions to dataframe
position_df = pd.DataFrame.from_dict(position_info, orient='index')
position_df['key'] = position_df.index
# merge dataframes
feature_df['merge'] = 1
position_df['merge'] = 1
merge_df = feature_df.merge(position_df, on='merge')
merge_df.drop(['merge'], inplace=True, axis=1)
# filter where key between start and end
merge_df = merge_df.loc[(merge_df.key > merge_df.start) & (merge_df.key < merge_df.end)]
"""
p3 = """
feature_df = pd.DataFrame(feature_info)
position_df = pd.DataFrame(position_info, index=[0])
hits = position_df.apply(lambda col: (feature_df [0] <= col.name) & (col.name < feature_df [1])).values.nonzero()
for f, p in zip(*hits):
position_info[position_df.columns[p]]
feature_info[f]
"""
print('bisect:',timeit.timeit(p1, setup=setup, number = 3))
print('panda method 1:',timeit.timeit(p2, setup=setup, number = 3))
print('panda method 2:',timeit.timeit(p3, setup=setup, number = 3))
bisect:0.08317881799985116
熊猫方法1:29.6151025639997
熊猫方法2:16.90901438500032
然而,bisect方法仅在没有重叠特征的情况下有效,例如
feature_info = [[1, 10, 'a'],[15, 30, 'b'],[40, 60, 'c'],[55, 71, 'd'],[2, 8, 'a_new']]
不起作用,它与pandas解决方案一起使用。
答案 0 :(得分:1)
bisect
库和函数对于这样的事情来说是惊人的。
我们基本上创建了一个功能范围的排序列表。如果您需要其他逻辑来检查某个职位是否属于某个功能范围,请与我们联系。
由于feature_info[n][0:1]
是2个值的范围,我们需要将bisect结果(这是一个索引位置)除以2。
from bisect import bisect
feature_info = [[1, 10, 'a'],[15, 30, 'b'],[40, 60, 'c'],[55, 71, 'd'],[73, 84, 'e']]
position_info = {5:'some info', 16:'some other info', 75:'last info'}
sections = sorted(r for a, b, c in feature_info for r in (a,b))
for pos in position_info:
print(pos, feature_info[bisect(sections, pos) / 2])
这将打印以下内容(您应该能够从中获取所需的所有信息,但我想显示基本结果):
(16, [15, 30, 'b'])
(75, [73, 84, 'e'])
(5, [1, 10, 'a'])
答案 1 :(得分:1)
文字说明好吗?
预处理:
index
,start/end
,{{1}的三元组})。按索引对此列表排序。算法(两个嵌套for循环):
请注意:
这将很快,因为您不需要在两个循环中查看任何位置或任何特征两次。如果位置不经常重叠,那么它实际上会接近O(N + M)复杂度(因此current_features集仍然很小)。
我认为没有重复的职位;处理这些会增加一些复杂性,但一般的方法仍然有效。
答案 2 :(得分:1)
最快的方法可能是使用快速库:pandas。熊猫矢量化你的操作以使它们快速。
feature_df = pd.DataFrame(feature_info)
position_df = pd.DataFrame(position_info, index=[0])
hits = position_df.apply(lambda col: (feature_df[0] <= col.name) & (col.name < feature_df[1])).values.nonzero()
for feature, position in zip(*hits):
print(position_info[position_df.columns[p]], "between", feature_info[f])
答案 3 :(得分:1)
还使用熊猫。首先将它们转换为数据帧,然后合并,然后过滤位置信息键位于要素信息列之间。
import pandas as pd
feature_info = [[1, 10, 'a'],[15, 30, 'b'],[40, 60, 'c'],[55, 71, 'd'],[73, 84, 'e']]
position_info = {5:'some info', 16:'some other info', 75:'last info'}
# feature info to dataframe
feature_df = pd.DataFrame(feature_info)
# rename feature df columns
feature_df.rename(index=str, columns={0: "start", 1: "end",2:'name'}, inplace=True)
# positions to dataframe
position_df = pd.DataFrame.from_dict(position_info, orient='index')
position_df['key'] = position_df.index
# merge dataframes
feature_df['merge'] = 1
position_df['merge'] = 1
merge_df = feature_df.merge(position_df, on='merge')
merge_df.drop(['merge'], inplace=True, axis=1)
# filter where key between start and end
merge_df = merge_df.loc[(merge_df.key > merge_df.start) & (merge_df.key < merge_df.end)]