从字典中获取值。键上的范围匹配

时间:2013-02-20 15:09:13

标签: python dictionary

我创建了一个字典myDict,其中包含以下表格中的1000万个条目。字典中的每个条目代表{(id, age): code}

>>> myDict = {('1039', '68.0864'): '42731,42781,V4501', 
              ('1039', '68.1704'): '4770,4778,V071', 
              ('0845', '60.4476'): '2724,27800,4019', 
              ('0983', '63.3936'): '41401,4168,4240,V1582,V7281'
             }

常量ageOffset定义为值= 0.1

给定(id,age)元组,如何从myDict获取具有键(id, X)的所有值,其中:

age <= X <= age+ageOffset 

我需要执行200次这种获取操作。

Examples: 
1. 
myTup = ('1039', '68.0')
the answer is: '42731,42781,V4501'

2. 
myTup = ('0845', '60.0')
Ans : No value returned 

编辑: 我可以在Key的第一个元素的部分匹配的基础上创建一个子字典。我的意思是,如果元组Key的第一个元素匹配,则创建一个子字典。根据我的数据,这不会超过几百。然后执行线性范围搜索,比较元组键中的第二个元素并找到相应的值。

3 个答案:

答案 0 :(得分:3)

要执行此操作200亿(!)次,您必须稍微预处理数据。

首先,我会按ID分组:

def preprocess(data):
    from collections import defaultdict # Python 2.5+ only
    preprocessed = defaultdict(list)
    # group by id
    for (id, age), value in data.iteritems():
        preprocessed[id].append((float(age), value))
    # sort lists for binary search, see edit
    for key, value in preprocessed.iteritems():
        value.sort()
    return preprocessed

结果应如下所示:

>>> preprocess(myDict)
defaultdict(<type 'list'>, {
    '0845': [(60.4476, '2724,27800,4019')],
    '0983': [(63.3936, '41401,4168,4240,V1582,V7281')],
    '1039': [(68.0864, '42731,42781,V4501'), (68.1704, '4770,4778,V071')]} 

如果相对较少的项目共享相同的ID,从而导致短列表,您可能会过滤列表。

def lookup(data, id, age, age_offset=0.1):
    if id in data:
        return [value for x, value in data[id] if age <= x <= age+age_offset]
    else:
        return None     

lookup(preprocessed, '1039', 68.0) # Note that I use floats for age
['42731,42781,V4501']

但是,如果许多项共享相同的ID,则必须遍历长列表,使查找相对较慢。在这种情况下,您将不得不应用进一步的优化。

编辑:正如@Andrey Petrov所建议的

from bisect import bisect_left
from itertools import islice, takewhile
def optimized_lookup(data, id, age, age_offset=0.1):
    if id in data:
        l = data[id]
        idx = bisect_left(l, age)
        return [a for a,v in takewhile(lambda (x, value): x <= age+age_offset, islice(l, idx, None))]
    else:
        return None 

答案 1 :(得分:1)

这是一种在numpy中实现的方法,虽然我还没有测试过,但我相信它会比循环遍历字典快得多。我用Numpy记录数组替换了字典结构,并使用np.where来定位它们与您给出的参数匹配的行。

import numpy as np

myDict = {('1039', '68.0864'): '42731,42781,V4501', 
              ('1039', '68.1704'): '4770,4778,V071', 
              ('0845', '60.4476'): '2724,27800,4019', 
              ('0983', '63.3936'): '41401,4168,4240,V1582,V7281'
             }

records=[]
for k,v in myDict.iteritems():
    records.append([k[0], float(k[1]), v])

myArr = np.rec.fromrecords(records, formats='S10, f4, S100', 
                             names="ID, Age, Code")

def findInMyArray(arr, requestedID, requestedAge, tolerance=0.1):
    idx = np.where(((arr["Age"] - requestedAge) < tolerance) & (arr["ID"] == requestedID))
    return idx

idx = findInMyArray(myArr, "1039", 68.0, tolerance=0.1)
print "The index found is: ", idx
print "The values are: ", myArr["Code"][idx[0]]

答案 2 :(得分:0)

def getr(t):
  id = float(t[0])
  age = float(t[1])
  os = 0.1
  rs = []
  correct_id=fixed[id]
  for k in correct_id.keys():
      if (k > age and k <= age + os):
          rs.append(correct_id.get(k))
  return rs

ct = {('1039', '68.0864'): '42731,42781,V4501',
      ('1039', '68.1704'): '4770,4778,V071',
      ('0845', '60.4476'): '2724,27800,4019',
      ('0983', '63.3936'): '41401,4168,4240,V1582,V7281' }

fixed={}

for k in ct:
    if not(float(k[0]) in fixed):
        fixed[float(k[0])]={}
    fixed[float(k[0])][float(k[1])] = ct[k]

print "1"
myTup = ('1039', '68.0')
assert(getr(myTup) == ['42731,42781,V4501'])

#the answer is: '42731,42781,V4501'

print "2"
myTup = ('0845', '60.0')
assert(getr(myTup) == [])
#Ans : No value returned