我正在努力找到一种有效的方法来将音素字符串(带有一些错误)与带模糊逻辑的正字句匹配。我的问题是,由于搜索空间非常大,即使在我的64GB机器上,我似乎也会耗尽内存。
例如我有一个输入:
sil s r eh d m ae ch ih ng iy t ae ih n ae d l sil
我想给我一个句子(或者至少尽可能接近):
thread matching yarn in tapestry needle
sil th r eh d m ae t ch iy ng y aa r n ih n t ae p ih s t r iy n iy d l sil
目前我正在使用fuzzywuzzy,我首先尝试整个匹配,如果不是最好的部分匹配句子块我运行多次以获得更大的块,以便我不会超载内存并最终得到我的由best_sents部分(par_sents)进入。到目前为止,这是我的代码:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from os import sys
rec_in = sys.argv[1]
candidates = sys.argv[2]
best_sents = sys.argv[3]
par_sents = sys.argv[4]
fuzz_list=[]
fuzz_par_list=[]
sorted_list = []
f1 = open(rec_in, 'r')
f2 = open(candidates, 'r')
f3 = open(best_sents, 'a')
f4 = open(par_sents, 'a')
lines_1 = f1.readlines()
lines_2 = f2.readlines()
for line_1 in lines_1:
line_1 = line_1.strip()
for line_2 in lines_2:
line_2 = line_2.strip()
fuzz_full = fuzz.ratio(line_2, line_1)
fuzz_list.append(str(fuzz_full) + "|" + str(line_2))
#found = False
for i in range(0, len(fuzz_list)):
string = fuzz_list[i]
score = string.split('|')[0]
phrase = string.split('|')[1]
if score == "100":
f3.write(phrase + '\n')
found = True
print "found best"
print "exiting... "
exit()
else:
for line_1 in lines_1:
fuzz_par = fuzz.ratio(phrase, line_1)
if fuzz_par > 0:
fuzz_par_list.append(str(fuzz_par) + "|" + str(phrase))
for x in sorted(fuzz_par_list, key =convert, reverse=True):
f4.write(x + '\n')
我成为下一轮候选人的部分比赛是这样的:
68|th r eh d m ae ch ih ng y aa r n ah s l iy p ih ng k ae p s l
68|th r eh d m ae ch ih ng y aa r n m ae ch ih ng y aa r n ih n
67|th r eh d m ae ch ih ng y aa r n
66|th r eh d m ae ch ih ng y aa r n ah s eh k ih n d t w ih ch t
64|th r eh d m ae ch ih ng y aa r n d aw n ih n b ih k ah r ih ng
64|th r eh d m ae ch ih ng y aa r n s ah ch ah d ih z ae s t ah r
63|th r eh d m ae ch ih ng y aa r n ey jh ih n s iy hh ay ah r d ah
60|th r eh d m ae ch ih ng y aa r n ih k s p r eh sh ih n ih n dh ah
60|ah s l iy p ih ng k ae p s l th r eh d m ae ch ih ng y aa r n m ae ch ih ng y aa r n ih n
60|s ah ch ah d ih z ae s t ah r th r eh d m ae ch ih ng y aa r n m ae ch ih ng y aa r n ih n
...
我想知道是否有人可以帮助我提高效率。有没有比fuzzywuzzy更好的软件包,还是我的代码需要工作?