我的问题与简单的单词相似性有点不同。问题是,有什么算法可以用来计算邮件地址和姓名之间的相似性。
for example:
mail Abd_tml_1132@gmail.com
Name Abdullah temel
levenstein,hamming distance 11
jaro distance 0.52
但最有可能的是,该邮件地址属于该名称。
答案 0 :(得分:1)
没有直接包装,但这可以解决您的问题:
将电子邮件ID放入列表
a = 'Abd_tml_1132@gmail.com'
rest = a.split('@', 1)[0] # Removing @
result = ''.join([i for i in rest if not i.isdigit()]) ## Removing digits as no names contains digits in them
list_of_email_words =result.split('_') # making a list of all the words. The separator can be changed from _ or . w.r.t to email id
list_of_email_words = list(filter(None, list_of_email_words )) # remove any blank values
将名称命名为列表:
b = 'Abdullah temel'
list_of_name_words =b.split(' ')
将模糊匹配应用于两个列表:
score =[]
for i in range(len(list_of_email_words)):
for j in range(len(list_of_name_words)):
d = fuzz.partial_ratio(list_of_email_words[i],list_of_name_words[j])
score.append(d)
现在,您只需要检查score
的任何元素是否大于您可以定义的阈值。例如:
threshold = 70
if any(x>threshold for x in score):
print ("matched")
答案 1 :(得分:0)
Fuzzywuzzy可以帮助您提供所需的解决方案。首先使用正则表达式从字符串中删除“ @”和域名。之后,您将拥有2个字符串-
import pandas as pd
from scipy.spatial import cKDTree
dataset1 = pd.DataFrame(pd.np.random.rand(100,3))
dataset2 = pd.DataFrame(pd.np.random.rand(10, 3))
ck = cKDTree(dataset1.values)
ck.query_ball_point(dataset2.values, r=0.1)
输出-
from fuzzywuzzy import fuzz as fz
str1 = "Abd_tml_1132"
str2 = "Abdullah temel"
count_ratio = fz.ratio(str1,str2)
print(count_ratio)