我对Python很新,我正在尝试使用模糊wuzzy进行模糊匹配。我相信我使用partial_ratio函数得到的比赛分数不正确。这是我的探索性代码:
>>>from fuzzywuzzy import fuzz
>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Barbil')
50
我相信这会得到100分,因为第二个字符串'Barbil'包含在第一个字符串中。当我尝试在第一个字符串的结尾或开头处取出几个字符时,我得到的匹配分数为100.
>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clear','Barbil')
100
>>> fuzz.partial_ratio('ect: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Orissa')
100
当第一个字符串的长度变为199时,它似乎从50分转换为100分。有没有人能够了解可能发生的事情?
答案 0 :(得分:1)
这是因为其中一个字符串是200 characters or longer, an automatic junk heuristic gets turned on in python's SequenceMatcher。 此代码应该适合您:
from difflib import SequenceMatcher
def partial_ratio(s1, s2):
""""Return the ratio of the most similar substring
as a number between 0 and 100."""
if len(s1) <= len(s2):
shorter = s1
longer = s2
else:
shorter = s2
longer = s1
m = SequenceMatcher(None, shorter, longer, autojunk=False)
blocks = m.get_matching_blocks()
# each block represents a sequence of matching characters in a string
# of the form (idx_1, idx_2, len)
# the best partial match will block align with at least one of those blocks
# e.g. shorter = "abcd", longer = XXXbcdeEEE
# block = (1,3,3)
# best score === ratio("abcd", "Xbcd")
scores = []
for (short_start, long_start, _) in blocks:
long_end = long_start + len(shorter)
long_substr = longer[long_start:long_end]
m2 = SequenceMatcher(None, shorter, long_substr, autojunk=False)
r = m2.ratio()
if r > .995:
return 100
else:
scores.append(r)
return max(scores) * 100.0