将列表列表添加到函数中

时间:2011-05-04 14:44:13

标签: python

我有一个多维数组,我试图将其输入difflib.get_close_matches()

我的数组如下所示:array[(ORIGINAL, FILTERED)]ORIGINAL是一个字符串,FILTEREDORIGINAL字符串,其中包含常用字词。

我目前创建了一个新数组,只有FILTERED个字被输入difflib.get_close_matches()。然后,我尝试将difflib的结果与array[(ORIGINAL, FILTERED)]匹配。我的问题是,我经常会有两个或更多FILTERED个单词相同,因此无法使用此方法进行匹配。

有没有办法可以将整个array[(ORIGINAL,FILTERED)]提供给difflib,但是只有FILTERED部分才会看到[(ORIGINAL,FILTERED)]部分?

提前致谢!

import  time
import  csv
import  difflib
import  sys
import  os.path
import  datetime

### Filters out common  words   in  an  attempt to  get better      results ###
def ignoredWords (word):
    filtered = word.lower()
    #Common Full Words
## Majority of filters were edited out
    #Common Abbreviations
    if "univ" in filtered:
        filtered = filtered.replace("univ","")
    #Special Characters
    if "  " in filtered: #Two White Spaces
        filtered = filtered.replace("  "," ")
    if "-" in filtered:
        filtered = filtered.replace("-"," ")
    if "\'" in filtered:
        filtered = filtered.replace("\'"," ")
    if " & " in filtered:
        filtered = filtered.replace(" &","")
    if "(\"" in filtered:
        filtered = filtered.replace("(\"","")
    if "\")" in filtered:
        filtered = filtered.replace("\")","")
    if "\t" in filtered:
        filtered = filtered.replace("\t"," ")
    return  filtered

### Takes in a list, then outputs a 2D list. array[Original, Filtered] ###
### For XXX: array[Original, Filtered, Account Number, Code] ###
def create2DArray (list):
    array = []
    for item in list:
        clean = ignoredWords(item[2])
        entry = (item[2].lower(), clean, item[0],item[1])
        array.append(entry)
    return array

def main(argv):
    if(len(argv) < 3):
        print "Not enough parameters. Please enter two file names"
        sys.exit(2)
    elif (not os.path.isfile(argv[1])):
        print "%s is not found" %(argv[1])
        sys.exit(2)
    elif (not os.path.isfile(argv[2])):
        print "%s is not found" %(argv[2])
        sys.exit(2)
    #Recode File ----- Not yet implemented
#       if(len(argv) == 4):
#       if(not os.path.isfile(argv[3])):
#           print "%s is not found" %(argv[3])
#           sys.exit(2)
#           
#       recode = open(argv[1], 'r')
#       try:
#           setRecode = c.readlines()
#       finally:
#           recode.close()
#           setRecode.sort()
#           print setRecode[0]
    #Measure execution time
    t0 = time.time()

    cReader = csv.reader(open(argv[1], 'rb'), delimiter='|')
    try:
        setC = []
        for row in cReader:
            setC.append(row)
    finally:
        setC.sort()

    aReader = csv.reader(open(argv[2], 'rb'), delimiter='|')
    try:
        setA = []
        for row in aReader:
            setA.append(row)
    finally:
        setA.sort()

    #Put Set A and Set C into their own 2 dimmensional arrays.array[Original Word]    [Cleaned Up Word]
    arrayC = create2DArray(setC)
    arrayA = create2DArray(setA)

    #Create clean list versions for use with difflib
    cleanListC = []
    for item in arrayC:
        cleanListC.append(item[1])

    cleanListA = []
    for item in arrayA:
        cleanListA.append(item[1])

    ############OUTPUT FILENAME############
    fMatch75 = open("Match75.csv", 'w')
    Match75 = csv.writer(fMatch75, dialect='excel')
    try:
        header = "Fuzzy Matching Report. Generated: "
        header += str(datetime.date.today())
        Match75.writerow([header])
        Match75.writerow(['C','A','C Cleaned','A Cleaned','C Account', 'C Group','A Account', 'A Group', 'Filtered Ratio %','Unfiltered Ratio %','Average Ratio %'])
        for item in cleanListC:
            match = difflib.get_close_matches(item,cleanListA,1,0.75)

            if len(match) > 0:
                filteredratio = difflib.SequenceMatcher(None,item,match[0]).ratio()
                strfilteredratio = '%.2f' % (filteredratio*100)
                found = 0
                for group in arrayA:
                    if match[0] == group[1]:
                        origA = group[0]
                        acode = group[3]
                        aaccount = group[2]
                        found = found + 1
                for group in arrayC:
                    if item == group[1]:
                        origC = group[0]
                        ccode = group[3]
                        caccount = group[2]
                        found = found + 2
                if found == 3:
                    unfilteredratio = difflib.SequenceMatcher(None,origC,origA).ratio()
                    strunfilteredratio = '%.2f' % (unfilteredratio*100)
                    averageratio = (filteredratio+unfilteredratio)/2
                    straverageratio = '%.2f' % (averageratio*100)

                    row = [origC.rstrip(),origA.rstrip(),item.rstrip(),match[0].rstrip(),caccount,ccode,aaccount,acode,strfilteredratio,strunfilteredratio,straverageratio]
                    Match75.writerow(row)
                #These Else Ifs are for debugging. If NULL is found anywhere in the CSV, then an error has occurred
                elif found == 2:
                    row = [origC.rstrip(),"NULL",item.rstrip(),match[0].rstrip(),caccount,ccode,"NULL","NULL",strfilteredratio,"NULL","NULL"]
                    Match75.writerow(row)
                elif found == 1:
                    row = ["NULL",origA.rstrip(),item.rstrip(),match[0].rstrip(),"NULL","NULL",aaccount,acode,strfilteredratio,"NULL","NULL"]
                    Match75.writerow(row)
            else:
                    row = ["NULL","NULL",item.rstrip(),match[0].rstrip(),"NULL","NULL","NULL","NULL",strfilteredratio,"NULL","NULL"]
                    Match75.writerow(row)

    finally:
        Match75.writerow(["A Proprietary and Confidential. Do Not Distribute"])
        fMatch75.close()

    print (time.time()-t0,"seconds")

if __name__ == "__main__":
    main(argv=sys.argv)

我想要实现的目标:

  1. 阅读输入文件
  2. 从名称中筛选出常用词,以便模糊匹配('difflib.get_close_matches()')将返回更准确的结果
  3. 将FileA中的名称与FileB中的名称进行比较,以找出哪一个最有可能匹配。
  4. 打印出原始(未过滤)名称和匹配百分比。
  5. 为什么这很难

    两个输入文件中使用的命名约定差异很大。一些名称部分缩写(EX:文件A:Acme公司;文件B:Acme Co)。由于命名约定不一致,我不能做'FileA.intersect(FileB)'这本来是理想的方式。

    应进行修改的地方

    for item in cleanListC:
        match = difflib.get_close_matches(item,cleanListA,1,0.75)
    

    CleanListA由以下人员创建:

    cleanListA = []
        for item in arrayA:
            cleanListA.append(item[1])
    

    因此失去(ORIGINAL,FILTERED)配对。

    结束目标

    我想将arrayA提供给difflib.get_close_matches()而不是cleanListA以保留(ORIGINAL,FILTERED)配对。在确定近似匹配时,difflib.get_close_matches()只会查看配对中的“过滤”部分,但会返回整个配对。

1 个答案:

答案 0 :(得分:0)

由于您已经直接使用SequenceMatcher来获得匹配率,因此您最直接的更改可能是自己执行get_close_matches操作。

比较get_close_matches()的来源[例如,第737行附近的http://svn.python.org/view/python/tags/r271/Lib/difflib.py?revision=86833&view=markup]。它返回具有最高比率的 n 序列的列表。由于您只想获得最佳匹配,因此您可以跟踪到目前为止比率最高的(ORIGINAL,FILTERED,ratio),而不是原始方法用于跟踪 n的heapq 最高。

例如,代替主循环,类似于:     

seqm = difflib.SequenceMatcher()

for i in arrayC:
  origC, cleanC, caccount, ccode = i
  seqm.set_seq2(cleanC)

  bestRatio = 0

  for j in arrayA:
    origA, cleanA = j[:2]
    seqm.set_seq1(cleanA)

    if (seqm.real_quick_ratio() >= bestRatio and
        seqm.quick_ratio() >= bestRatio):
      r = seqm.ratio()
      if r >= bestRatio:
        bestRatio = r
        bestA = j

  if bestRatio >= 0.75: # the cutoff from the original get_close_matches() call
    origA, cleanA, aaccount, acode = bestA

    filteredratio = bestRatio
    strfilteredratio = '%.2f' % (filteredratio*100)

    seqm.set_seqs( origC, origA )
    unfilteredratio = seqm.ratio()
    strunfilteredratio = '%.2f' % (unfilteredratio*100)

    averageratio = (filteredratio+unfilteredratio)/2
    straverageratio = '%.2f' % (averageratio*100)

    row = [origC.rstrip(),origA.rstrip(),cleanC.rstrip(),cleanA.rstrip(),caccount,ccode,aaccount,acode,strfilteredratio,strunfilteredratio,straverageratio]
  else:
    row = ["NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","0.00","NULL","NULL"]

  Match75.writerow(row)