我有一个像这样的字符串格式:
示例:'2009060712ab56c'
假设我想将其与另一个字符串进行比较,并给出一定百分比的格式相似度,例如:
result = format_similarity('2009060712ab56c', '20070908njndla56gjhk')
在这种情况下,结果是80%。
有办法吗?
答案 0 :(得分:0)
您的格式由两个不同的属性组成,这两个属性的度量方法有所不同。如何将它们组合成格式的总体百分比相似度将是业务逻辑问题。例如,如果开始时缺少数字,由于不再是日期了,现在是否完全不同?还是相似?但是,您可以通过以下方式获取测量值:
import re
def determine_similarity(string, other):
length_string = len(string) # use len to get the number of characters in the string
length_other = len(other)
number_of_numbers_string = _determine_number_of_numbers(string)
number_of_numbers_other = _determine_number_of_numbers(other)
<some logic here to create a metric of simiarity>
<find the differences and divide them?>
LEADING_NUMBERS = re.compile(
r"^" # anchor at start of string
r"[0-9]" # Must be a number
r"+" # One or more matches
)
def _determine_number_of_numbers(string):
"""
Determine how many LEADING numbers are in a string
"""
match = LEADING_NUMBERS.search(string)
if match is not None:
length = len(match.group()) # Number of numbers is length of number match group
else:
length = 0 # No match means no numbers
<You might want to check whether the numbers constitute a date within a certain range or something like that>
<For example, take the first four number and check whether the year is between 1980 and 2018>
return length
答案 1 :(得分:0)
JETM在评论中指出,https://pypi.org/project/python-Levenshtein/可能是比较“紧密度”的好资源,例如,编辑两个字符串的距离(必须对一个字符串进行多少更改才能匹配另一个字符串)
您可以创建自己的“编辑距离”实现,以匹配您的自定义规则,例如: