我创建了一个字典myDict
,其中包含以下表格中的1000万个条目。字典中的每个条目代表{(id, age): code}
>>> myDict = {('1039', '68.0864'): '42731,42781,V4501',
('1039', '68.1704'): '4770,4778,V071',
('0845', '60.4476'): '2724,27800,4019',
('0983', '63.3936'): '41401,4168,4240,V1582,V7281'
}
常量ageOffset
定义为值= 0.1
给定(id,age)
元组,如何从myDict
获取具有键(id, X)
的所有值,其中:
age <= X <= age+ageOffset
我需要执行200次这种获取操作。
Examples:
1.
myTup = ('1039', '68.0')
the answer is: '42731,42781,V4501'
2.
myTup = ('0845', '60.0')
Ans : No value returned
编辑: 我可以在Key的第一个元素的部分匹配的基础上创建一个子字典。我的意思是,如果元组Key的第一个元素匹配,则创建一个子字典。根据我的数据,这不会超过几百。然后执行线性范围搜索,比较元组键中的第二个元素并找到相应的值。
答案 0 :(得分:3)
要执行此操作200亿(!)次,您必须稍微预处理数据。
首先,我会按ID分组:
def preprocess(data):
from collections import defaultdict # Python 2.5+ only
preprocessed = defaultdict(list)
# group by id
for (id, age), value in data.iteritems():
preprocessed[id].append((float(age), value))
# sort lists for binary search, see edit
for key, value in preprocessed.iteritems():
value.sort()
return preprocessed
结果应如下所示:
>>> preprocess(myDict)
defaultdict(<type 'list'>, {
'0845': [(60.4476, '2724,27800,4019')],
'0983': [(63.3936, '41401,4168,4240,V1582,V7281')],
'1039': [(68.0864, '42731,42781,V4501'), (68.1704, '4770,4778,V071')]}
如果相对较少的项目共享相同的ID,从而导致短列表,您可能会过滤列表。
def lookup(data, id, age, age_offset=0.1):
if id in data:
return [value for x, value in data[id] if age <= x <= age+age_offset]
else:
return None
lookup(preprocessed, '1039', 68.0) # Note that I use floats for age
['42731,42781,V4501']
但是,如果许多项共享相同的ID,则必须遍历长列表,使查找相对较慢。在这种情况下,您将不得不应用进一步的优化。
编辑:正如@Andrey Petrov所建议的
from bisect import bisect_left
from itertools import islice, takewhile
def optimized_lookup(data, id, age, age_offset=0.1):
if id in data:
l = data[id]
idx = bisect_left(l, age)
return [a for a,v in takewhile(lambda (x, value): x <= age+age_offset, islice(l, idx, None))]
else:
return None
答案 1 :(得分:1)
这是一种在numpy中实现的方法,虽然我还没有测试过,但我相信它会比循环遍历字典快得多。我用Numpy记录数组替换了字典结构,并使用np.where来定位它们与您给出的参数匹配的行。
import numpy as np
myDict = {('1039', '68.0864'): '42731,42781,V4501',
('1039', '68.1704'): '4770,4778,V071',
('0845', '60.4476'): '2724,27800,4019',
('0983', '63.3936'): '41401,4168,4240,V1582,V7281'
}
records=[]
for k,v in myDict.iteritems():
records.append([k[0], float(k[1]), v])
myArr = np.rec.fromrecords(records, formats='S10, f4, S100',
names="ID, Age, Code")
def findInMyArray(arr, requestedID, requestedAge, tolerance=0.1):
idx = np.where(((arr["Age"] - requestedAge) < tolerance) & (arr["ID"] == requestedID))
return idx
idx = findInMyArray(myArr, "1039", 68.0, tolerance=0.1)
print "The index found is: ", idx
print "The values are: ", myArr["Code"][idx[0]]
答案 2 :(得分:0)
def getr(t):
id = float(t[0])
age = float(t[1])
os = 0.1
rs = []
correct_id=fixed[id]
for k in correct_id.keys():
if (k > age and k <= age + os):
rs.append(correct_id.get(k))
return rs
ct = {('1039', '68.0864'): '42731,42781,V4501',
('1039', '68.1704'): '4770,4778,V071',
('0845', '60.4476'): '2724,27800,4019',
('0983', '63.3936'): '41401,4168,4240,V1582,V7281' }
fixed={}
for k in ct:
if not(float(k[0]) in fixed):
fixed[float(k[0])]={}
fixed[float(k[0])][float(k[1])] = ct[k]
print "1"
myTup = ('1039', '68.0')
assert(getr(myTup) == ['42731,42781,V4501'])
#the answer is: '42731,42781,V4501'
print "2"
myTup = ('0845', '60.0')
assert(getr(myTup) == [])
#Ans : No value returned