这里是我编写的代码,用于将与每个键相关联的值列表与字典中的所有其他键进行比较...但是在csv文件中需要花费大量时间来处理大约10000条记录。可以body帮助优化代码在最短的时间内执行..不用担心外部函数调用,它工作正常。
import csv
import sys
file = sys.argv[1]
with open(file, 'rU') as inf:
csvreader=csv.DictReader(inf,delimiter=',')
result={}
temp = []
#Creating Dict
for r in csvreader:
name=[]
name.append(r['FIRST_NAME'])
name.append(r['LAST_NAME'])
name.append(r['ID'])
result.setdefault(r['GROUP_KEY'],[]).append(name)
#Processing the Dict
for key1 in result.keys():
temp.append(key1)
for key2 in result.keys():
if key1 != key2 and key2 not in ex:
for v1 in result[key1]:
for v2 in result[key2]:
score=name_match_score(v1,'',v2,'')[0] ####calling external function
if score > 0.90:
print v1[2],v2[2],score
答案 0 :(得分:0)
这样的事情会有所帮助。目标是通过跳过冗余计算和缓存计算来减少name_match_score
中完成的原始计算次数。
首先,使您的字典存储为元组列表的默认指令。元组是不可变的,因此它们可以用作下面的集合和词组中的键。
from collections import defaultdict
import csv
import sys
file = sys.argv[1]
with open(file, 'rU') as inf:
csvreader=csv.DictReader(inf, delimiter=',')
result = defaultdict(list)
for r in csvreader:
name = (r['FIRST_NAME'], r['LAST_NAME'], r['ID'])
result[r['GROUP_KEY']].append(name)
然后,对键进行排序,以确保只评估一对键。
keys = sorted(result)
for i, key1 in enumerate(keys):
for key2 in keys[i+1:]:
订购v1
和v2
,以便它们形成唯一的密钥。这有助于缓存。
for v1 in result[key1]:
for v2 in result[key2]:
v1, v2 = (min(v1, v2), max(v1, v2))
score=name_match_score(v1, v2)[0] ####calling external function
if score > 0.90:
print v1[2],v2[2],score
然后使用memoizing decorator缓存计算:
class memoized(object):
'''Decorator. Caches a function's return value each time it is called.
If called later with the same arguments, the cached value is returned
(not reevaluated).
'''
def __init__(self, func):
self.func = func
self.cache = {}
def __call__(self, *args):
if not isinstance(args, collections.Hashable):
# uncacheable. a list, for instance.
# better to not cache than blow up.
return self.func(*args)
if args in self.cache:
return self.cache[args]
else:
value = self.func(*args)
self.cache[args] = value
return value
def __repr__(self):
'''Return the function's docstring.'''
return self.func.__doc__
def __get__(self, obj, objtype):
'''Support instance methods.'''
return functools.partial(self.__call__, obj)
并更改name_match_score
以使用装饰器:
@memoized
def name_match_score(v1, v2):
# Whatever this does
return (0.75, )
这应该最大限度地减少你所做的name_match_score
内的原始计算次数。