将与字典中的键相关联的值列表与关联所有其他键的值进行比较

时间:2014-07-12 03:20:37

标签: python dictionary

这里是我编写的代码,用于将与每个键相关联的值列表与字典中的所有其他键进行比较...但是在csv文件中需要花费大量时间来处理大约10000条记录。可以body帮助优化代码在最短的时间内执行..不用担心外部函数调用,它工作正常。

import csv
import sys
file = sys.argv[1]
with open(file, 'rU') as inf:
    csvreader=csv.DictReader(inf,delimiter=',')
    result={}
    temp = []
#Creating Dict
    for r in csvreader:
        name=[]
        name.append(r['FIRST_NAME'])
        name.append(r['LAST_NAME'])
        name.append(r['ID'])
        result.setdefault(r['GROUP_KEY'],[]).append(name) 

#Processing the Dict

for key1 in result.keys():
    temp.append(key1)
    for key2 in result.keys():
        if key1 != key2 and key2 not in ex:
            for v1 in result[key1]:
                for v2 in result[key2]:
                    score=name_match_score(v1,'',v2,'')[0] ####calling external function
                    if score > 0.90:
                        print v1[2],v2[2],score

1 个答案:

答案 0 :(得分:0)

这样的事情会有所帮助。目标是通过跳过冗余计算和缓存计算来减少name_match_score中完成的原始计算次数。

首先,使您的字典存储为元组列表的默认指令。元组是不可变的,因此它们可以用作下面的集合和词组中的键。

from collections import defaultdict
import csv
import sys

file = sys.argv[1]
with open(file, 'rU') as inf:
    csvreader=csv.DictReader(inf, delimiter=',')
    result = defaultdict(list)
    for r in csvreader:
        name = (r['FIRST_NAME'], r['LAST_NAME'], r['ID'])
        result[r['GROUP_KEY']].append(name)

然后,对键进行排序,以确保只评估一对键。

keys = sorted(result)
for i, key1 in enumerate(keys):
    for key2 in keys[i+1:]:

订购v1v2,以便它们形成唯一的密钥。这有助于缓存。

        for v1 in result[key1]:
            for v2 in result[key2]:
                v1, v2 = (min(v1, v2), max(v1, v2))
                score=name_match_score(v1, v2)[0] ####calling external function
                if score > 0.90:
                    print v1[2],v2[2],score

然后使用memoizing decorator缓存计算:

class memoized(object):
    '''Decorator. Caches a function's return value each time it is called.
    If called later with the same arguments, the cached value is returned
    (not reevaluated).
    '''
    def __init__(self, func):
        self.func = func
        self.cache = {}
    def __call__(self, *args):
        if not isinstance(args, collections.Hashable):
            # uncacheable. a list, for instance.
            # better to not cache than blow up.
            return self.func(*args)
        if args in self.cache:
            return self.cache[args]
        else:
            value = self.func(*args)
            self.cache[args] = value
            return value
    def __repr__(self):
        '''Return the function's docstring.'''
        return self.func.__doc__
    def __get__(self, obj, objtype):
        '''Support instance methods.'''
        return functools.partial(self.__call__, obj)

并更改name_match_score以使用装饰器:

@memoized
def name_match_score(v1, v2):
    # Whatever this does
    return (0.75, )

这应该最大限度地减少你所做的name_match_score内的原始计算次数。