我有一个多维数组,我试图将其输入difflib.get_close_matches()
。
我的数组如下所示:array[(ORIGINAL, FILTERED)]
。 ORIGINAL
是一个字符串,FILTERED
是ORIGINAL
字符串,其中包含常用字词。
我目前创建了一个新数组,只有FILTERED
个字被输入difflib.get_close_matches()
。然后,我尝试将difflib
的结果与array[(ORIGINAL, FILTERED)]
匹配。我的问题是,我经常会有两个或更多FILTERED
个单词相同,因此无法使用此方法进行匹配。
有没有办法可以将整个array[(ORIGINAL,FILTERED)]
提供给difflib
,但是只有FILTERED
部分才会看到[(ORIGINAL,FILTERED)]
部分?
提前致谢!
import time
import csv
import difflib
import sys
import os.path
import datetime
### Filters out common words in an attempt to get better results ###
def ignoredWords (word):
filtered = word.lower()
#Common Full Words
## Majority of filters were edited out
#Common Abbreviations
if "univ" in filtered:
filtered = filtered.replace("univ","")
#Special Characters
if " " in filtered: #Two White Spaces
filtered = filtered.replace(" "," ")
if "-" in filtered:
filtered = filtered.replace("-"," ")
if "\'" in filtered:
filtered = filtered.replace("\'"," ")
if " & " in filtered:
filtered = filtered.replace(" &","")
if "(\"" in filtered:
filtered = filtered.replace("(\"","")
if "\")" in filtered:
filtered = filtered.replace("\")","")
if "\t" in filtered:
filtered = filtered.replace("\t"," ")
return filtered
### Takes in a list, then outputs a 2D list. array[Original, Filtered] ###
### For XXX: array[Original, Filtered, Account Number, Code] ###
def create2DArray (list):
array = []
for item in list:
clean = ignoredWords(item[2])
entry = (item[2].lower(), clean, item[0],item[1])
array.append(entry)
return array
def main(argv):
if(len(argv) < 3):
print "Not enough parameters. Please enter two file names"
sys.exit(2)
elif (not os.path.isfile(argv[1])):
print "%s is not found" %(argv[1])
sys.exit(2)
elif (not os.path.isfile(argv[2])):
print "%s is not found" %(argv[2])
sys.exit(2)
#Recode File ----- Not yet implemented
# if(len(argv) == 4):
# if(not os.path.isfile(argv[3])):
# print "%s is not found" %(argv[3])
# sys.exit(2)
#
# recode = open(argv[1], 'r')
# try:
# setRecode = c.readlines()
# finally:
# recode.close()
# setRecode.sort()
# print setRecode[0]
#Measure execution time
t0 = time.time()
cReader = csv.reader(open(argv[1], 'rb'), delimiter='|')
try:
setC = []
for row in cReader:
setC.append(row)
finally:
setC.sort()
aReader = csv.reader(open(argv[2], 'rb'), delimiter='|')
try:
setA = []
for row in aReader:
setA.append(row)
finally:
setA.sort()
#Put Set A and Set C into their own 2 dimmensional arrays.array[Original Word] [Cleaned Up Word]
arrayC = create2DArray(setC)
arrayA = create2DArray(setA)
#Create clean list versions for use with difflib
cleanListC = []
for item in arrayC:
cleanListC.append(item[1])
cleanListA = []
for item in arrayA:
cleanListA.append(item[1])
############OUTPUT FILENAME############
fMatch75 = open("Match75.csv", 'w')
Match75 = csv.writer(fMatch75, dialect='excel')
try:
header = "Fuzzy Matching Report. Generated: "
header += str(datetime.date.today())
Match75.writerow([header])
Match75.writerow(['C','A','C Cleaned','A Cleaned','C Account', 'C Group','A Account', 'A Group', 'Filtered Ratio %','Unfiltered Ratio %','Average Ratio %'])
for item in cleanListC:
match = difflib.get_close_matches(item,cleanListA,1,0.75)
if len(match) > 0:
filteredratio = difflib.SequenceMatcher(None,item,match[0]).ratio()
strfilteredratio = '%.2f' % (filteredratio*100)
found = 0
for group in arrayA:
if match[0] == group[1]:
origA = group[0]
acode = group[3]
aaccount = group[2]
found = found + 1
for group in arrayC:
if item == group[1]:
origC = group[0]
ccode = group[3]
caccount = group[2]
found = found + 2
if found == 3:
unfilteredratio = difflib.SequenceMatcher(None,origC,origA).ratio()
strunfilteredratio = '%.2f' % (unfilteredratio*100)
averageratio = (filteredratio+unfilteredratio)/2
straverageratio = '%.2f' % (averageratio*100)
row = [origC.rstrip(),origA.rstrip(),item.rstrip(),match[0].rstrip(),caccount,ccode,aaccount,acode,strfilteredratio,strunfilteredratio,straverageratio]
Match75.writerow(row)
#These Else Ifs are for debugging. If NULL is found anywhere in the CSV, then an error has occurred
elif found == 2:
row = [origC.rstrip(),"NULL",item.rstrip(),match[0].rstrip(),caccount,ccode,"NULL","NULL",strfilteredratio,"NULL","NULL"]
Match75.writerow(row)
elif found == 1:
row = ["NULL",origA.rstrip(),item.rstrip(),match[0].rstrip(),"NULL","NULL",aaccount,acode,strfilteredratio,"NULL","NULL"]
Match75.writerow(row)
else:
row = ["NULL","NULL",item.rstrip(),match[0].rstrip(),"NULL","NULL","NULL","NULL",strfilteredratio,"NULL","NULL"]
Match75.writerow(row)
finally:
Match75.writerow(["A Proprietary and Confidential. Do Not Distribute"])
fMatch75.close()
print (time.time()-t0,"seconds")
if __name__ == "__main__":
main(argv=sys.argv)
我想要实现的目标:
为什么这很难
两个输入文件中使用的命名约定差异很大。一些名称部分缩写(EX:文件A:Acme公司;文件B:Acme Co)。由于命名约定不一致,我不能做'FileA.intersect(FileB)'这本来是理想的方式。
应进行修改的地方
for item in cleanListC:
match = difflib.get_close_matches(item,cleanListA,1,0.75)
CleanListA由以下人员创建:
cleanListA = []
for item in arrayA:
cleanListA.append(item[1])
因此失去(ORIGINAL,FILTERED)
配对。
结束目标
我想将arrayA提供给difflib.get_close_matches()
而不是cleanListA以保留(ORIGINAL,FILTERED)
配对。在确定近似匹配时,difflib.get_close_matches()
只会查看配对中的“过滤”部分,但会返回整个配对。
答案 0 :(得分:0)
由于您已经直接使用SequenceMatcher
来获得匹配率,因此您最直接的更改可能是自己执行get_close_matches
操作。
比较get_close_matches()的来源[例如,第737行附近的http://svn.python.org/view/python/tags/r271/Lib/difflib.py?revision=86833&view=markup]。它返回具有最高比率的 n 序列的列表。由于您只想获得最佳匹配,因此您可以跟踪到目前为止比率最高的(ORIGINAL,FILTERED,ratio),而不是原始方法用于跟踪 n的heapq
最高。
例如,代替主循环,类似于:
seqm = difflib.SequenceMatcher()
for i in arrayC:
origC, cleanC, caccount, ccode = i
seqm.set_seq2(cleanC)
bestRatio = 0
for j in arrayA:
origA, cleanA = j[:2]
seqm.set_seq1(cleanA)
if (seqm.real_quick_ratio() >= bestRatio and
seqm.quick_ratio() >= bestRatio):
r = seqm.ratio()
if r >= bestRatio:
bestRatio = r
bestA = j
if bestRatio >= 0.75: # the cutoff from the original get_close_matches() call
origA, cleanA, aaccount, acode = bestA
filteredratio = bestRatio
strfilteredratio = '%.2f' % (filteredratio*100)
seqm.set_seqs( origC, origA )
unfilteredratio = seqm.ratio()
strunfilteredratio = '%.2f' % (unfilteredratio*100)
averageratio = (filteredratio+unfilteredratio)/2
straverageratio = '%.2f' % (averageratio*100)
row = [origC.rstrip(),origA.rstrip(),cleanC.rstrip(),cleanA.rstrip(),caccount,ccode,aaccount,acode,strfilteredratio,strunfilteredratio,straverageratio]
else:
row = ["NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","0.00","NULL","NULL"]
Match75.writerow(row)