def email_matcher(emails_file,names_file): matches = {}
with open(names_file, 'r') as names:
for i in names:
with open(emails_file, 'r') as emails:
first = i[:(i.index(' '))]
pattern2 = i[0]
last = i[::-1].strip()
last = last[0:(last.index(' '))][::-1]
for j in emails:
if re.search(first,j):
matches[i] = j
elif re.search(last,j):
matches[i] = j
else:
matches[i] = 'nothing found'
return matches
pass
这是我的代码到目前为止,我知道它不起作用,我得到的东西是找不到匹配。目标是查看所有电子邮件,找到最匹配的电子邮件名称。我不知道如何制作正则表达式的模式,我试着查看文档但是确定要做的事情。我想做的是以最准确的顺序检查不同的东西
1 - 检查名字姓氏和中间名是否在电子邮件中 2-检查名字和姓氏是否在电子邮件中 3 - 检查姓名是否是最后一个姓名 4 - 检查是否有第一个姓氏 5 - 检查是否有名字 6 - 检查姓氏是否
在整个电子邮件中进行多次搜索是否有6种不同的正则表达式搜索,或者是否有办法在每封电子邮件上进行一次搜索,看看它是否会触及模式中的任何组
现在,在我的代码中,我只有一个名字和姓氏搜索,完全没有权利。
添加电子邮件
Mary Williams - mary.williams@gmail.com
Charles Deanna West - charles.west@yahoo.com
Jacob Jessica Andrews - jandrews@hotmail.com
Javier Daisy Sparks - javier.sparks@gmail.com
Paula A. Graham - graham@gmail.com(找不到最匹配的,没有人有paula。名单中也有多个paulas和grahams)
Jasmine Sherman - jherman@hotmail.com
Matthew Foster - matthew.foster@gmail.com
Ernest Michael Bowman - ernest.bowman@gmail.com
Chad Hernandez - hernandez@gmail.com
所以我只是浏览了所有这些,看起来模式是firstinitiallastname,firstname.lastname或lastname @ email。虽然有大量的名字和更多的电子邮件,所以我不知道一般情况。但是我觉得只要我找到firstname.lastname@email,然后是firstinitiallastname @ email,然后是middleinitallastname @ email,那么最糟糕的情况就是@name,这就足够了吗?
答案 0 :(得分:1)
这是一种无需使用正则表达式但可以使用称为Levenshtein的模糊匹配系统的方法。
首先,将电子邮件与域分开,以使@ something.com位于不同的列中。
接下来,听起来您正在描述一种称为Levenshtein距离的模糊匹配算法。您可以使用为此设计的模块,也可以编写自定义模块:
import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
现在,您可以得到一个数值,说明它们的相似程度。您仍然需要确定您可以接受的数字。
Str1 = "Apple Inc."
Str2 = "apple Inc"
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)
除了Levenshtein以外,还有其他相似性算法。您可以尝试Jaro-Winkler或Trigram。
我从https://www.datacamp.com/community/tutorials/fuzzy-string-python
获得了此代码答案 1 :(得分:0)
好的,我发现该模式适用于所有内容