有效的模糊字符串匹配

时间:2017-07-18 02:17:12

标签: python string-matching fuzzy-logic fuzzywuzzy

我正在努力找到一种有效的方法来将音素字符串(带有一些错误)与带模糊逻辑的正字句匹配。我的问题是,由于搜索空间非常大,即使在我的64GB机器上,我似乎也会耗尽内存。

例如我有一个输入:

sil s r eh d m ae ch ih ng iy t ae ih n ae d l sil

我想给我一个句子(或者至少尽可能接近):

thread matching yarn in tapestry needle
sil th r eh d m ae t ch iy ng y aa r n ih n t ae p ih s t r iy n iy d l sil

目前我正在使用fuzzywuzzy,我首先尝试整个匹配,如果不是最好的部分匹配句子块我运行多次以获得更大的块,以便我不会超载内存并最终得到我的由best_sents部分(par_sents)进入。到目前为止,这是我的代码:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from os import sys

rec_in = sys.argv[1]
candidates = sys.argv[2]
best_sents = sys.argv[3]
par_sents = sys.argv[4]

fuzz_list=[]
fuzz_par_list=[]
sorted_list = []
f1 = open(rec_in, 'r')
f2 = open(candidates, 'r')
f3 = open(best_sents, 'a')
f4 = open(par_sents, 'a')
lines_1 = f1.readlines()
lines_2 = f2.readlines()

for line_1 in lines_1:
    line_1 = line_1.strip()
    for line_2 in lines_2:
        line_2 = line_2.strip()
        fuzz_full = fuzz.ratio(line_2, line_1)
        fuzz_list.append(str(fuzz_full) + "|" + str(line_2))  

#found = False
for i in range(0, len(fuzz_list)):
    string = fuzz_list[i]
    score = string.split('|')[0]
    phrase = string.split('|')[1]

    if score == "100":
        f3.write(phrase + '\n')
        found = True
        print "found best"
        print "exiting... "
        exit()
    else:
        for line_1 in lines_1:
            fuzz_par = fuzz.ratio(phrase, line_1)
            if fuzz_par > 0:
                fuzz_par_list.append(str(fuzz_par) + "|" + str(phrase))


for x in sorted(fuzz_par_list, key =convert, reverse=True):
    f4.write(x + '\n')

我成为下一轮候选人的部分比赛是这样的:

68|th r eh d m ae ch ih ng y aa r n ah s l iy p ih ng k ae p s l
68|th r eh d m ae ch ih ng y aa r n m ae ch ih ng y aa r n ih n
67|th r eh d m ae ch ih ng y aa r n
66|th r eh d m ae ch ih ng y aa r n ah s eh k ih n d t w ih ch t
64|th r eh d m ae ch ih ng y aa r n d aw n ih n b ih k ah r ih ng
64|th r eh d m ae ch ih ng y aa r n s ah ch ah d ih z ae s t ah r
63|th r eh d m ae ch ih ng y aa r n ey jh ih n s iy hh ay ah r d ah
60|th r eh d m ae ch ih ng y aa r n ih k s p r eh sh ih n ih n dh ah
60|ah s l iy p ih ng k ae p s l th r eh d m ae ch ih ng y aa r n m ae ch ih ng y aa r n ih n
60|s ah ch ah d ih z ae s t ah r th r eh d m ae ch ih ng y aa r n m ae ch ih ng y aa r n ih n
...

我想知道是否有人可以帮助我提高效率。有没有比fuzzywuzzy更好的软件包,还是我的代码需要工作?

0 个答案:

没有答案