我正在尝试将一组关键字与一组文本详细信息进行匹配,并希望使用difflib找出哪些关键字与一组特定的文本高度相关:
from difflib import SequenceMatcher
import re
import operator
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
with open('awl.csv',encoding='utf-8') as f:
csv_read = csv.reader(f)
next(csv_read)
for line in csv_read:
line[18] = cleanhtml(line[18])
lines = line[18].split(" ")
t=[]
for i in lines:
i = cleanhtml(i)
i = ((re.sub(r"[^a-zA-Z0-9]+", ' ', i)).lower()).replace(" ", "")
s = SequenceMatcher(None, i, "girl")
# print(s.ratio())
t.append(s.ratio())
t.sort()
d = {x:t.count(x) for x in t}
d.pop(0.0,None)
# print(d)
try:
print(line[0]+" " + str(d[1]))
except:
print(line[0]+" Nothing Matched")
现在,以上代码将与多个文本进行比较时,返回具有difflib比率为1的关键字的频率。从它们中,我将挑选出最大值,并且对完整的关键字集也一样。但是我只是不想获得difflib比率为“ 1.0”但difflib比率在0.8到1.0范围内的频率。我们强烈欢迎您提出其他建议,以寻求更好的匹配算法。