使用正则表达式

时间:2016-10-27 21:07:48

标签: python regex email

def email_matcher(emails_file,names_file):     matches = {}

with open(names_file, 'r') as names:
    for i in names:
        with open(emails_file, 'r') as emails:
            first = i[:(i.index(' '))]
            pattern2 = i[0]
            last = i[::-1].strip()
            last = last[0:(last.index(' '))][::-1]
            for j in emails:
                if re.search(first,j):
                    matches[i] = j
                elif re.search(last,j):
                    matches[i] = j
                else:
                    matches[i] = 'nothing found'
return matches
pass

这是我的代码到目前为止,我知道它不起作用,我得到的东西是找不到匹配。目标是查看所有电子邮件,找到最匹配的电子邮件名称。我不知道如何制作正则表达式的模式,我试着查看文档但是确定要做的事情。我想做的是以最准确的顺序检查不同的东西

1 - 检查名字姓氏和中间名是否在电子邮件中 2-检查名字和姓氏是否在电子邮件中 3 - 检查姓名是否是最后一个姓名 4 - 检查是否有第一个姓氏 5 - 检查是否有名字 6 - 检查姓氏是否

在整个电子邮件中进行多次搜索是否有6种不同的正则表达式搜索,或者是否有办法在每封电子邮件上进行一次搜索,看看它是否会触及模式中的任何组

现在,在我的代码中,我只有一个名字和姓氏搜索,完全没有权利。

添加电子邮件

Mary Williams - mary.williams@gmail.com

Charles Deanna West - charles.west@yahoo.com

Jacob Jessica Andrews - jandrews@hotmail.com

Javier Daisy Sparks - javier.sparks@gmail.com

Paula A. Graham - graham@gmail.com(找不到最匹配的,没有人有paula。名单中也有多个paulas和grahams)

Jasmine Sherman - jherman@hotmail.com

Matthew Foster - matthew.foster@gmail.com

Ernest Michael Bowman - ernest.bowman@gmail.com

Chad Hernandez - hernandez@gmail.com

所以我只是浏览了所有这些,看起来模式是firstinitiallastname,firstname.lastname或lastname @ email。虽然有大量的名字和更多的电子邮件,所以我不知道一般情况。但是我觉得只要我找到firstname.lastname@email,然后是firstinitiallastname @ email,然后是middleinitallastname @ email,那么最糟糕的情况就是@name,这就足够了吗?

2 个答案:

答案 0 :(得分:1)

这是一种无需使用正则表达式但可以使用称为Levenshtein的模糊匹配系统的方法。

首先,将电子邮件与域分开,以使@ something.com位于不同的列中。

接下来,听起来您正在描述一种称为Levenshtein距离的模糊匹配算法。您可以使用为此设计的模块,也可以编写自定义模块:

import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
    """ levenshtein_ratio_and_distance:
        Calculates levenshtein distance between two strings.
        If ratio_calc = True, the function computes the
        levenshtein distance ratio of similarity between two strings
        For all i and j, distance[i,j] will contain the Levenshtein
        distance between the first i characters of s and the
        first j characters of t
    """
    # Initialize matrix of zeros
    rows = len(s)+1
    cols = len(t)+1
    distance = np.zeros((rows,cols),dtype = int)

    # Populate matrix of zeros with the indeces of each character of both strings
    for i in range(1, rows):
        for k in range(1,cols):
            distance[i][0] = i
            distance[0][k] = k

    # Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions    
    for col in range(1, cols):
        for row in range(1, rows):
            if s[row-1] == t[col-1]:
                cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
            else:
                # In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
                # the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
                if ratio_calc == True:
                    cost = 2
                else:
                    cost = 1
            distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
                                 distance[row][col-1] + 1,          # Cost of insertions
                                 distance[row-1][col-1] + cost)     # Cost of substitutions
    if ratio_calc == True:
        # Computation of the Levenshtein Distance Ratio
        Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
        return Ratio
    else:
        # print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
        # insertions and/or substitutions
        # This is the minimum number of edits needed to convert string a to string b
        return "The strings are {} edits away".format(distance[row][col])

现在,您可以得到一个数值,说明它们的相似程度。您仍然需要确定您可以接受的数字。

Str1 = "Apple Inc."
Str2 = "apple Inc"
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)

除了Levenshtein以外,还有其他相似性算法。您可以尝试Jaro-Winkler或Trigram。

我从https://www.datacamp.com/community/tutorials/fuzzy-string-python

获得了此代码

答案 1 :(得分:0)

好的,我发现该模式适用于所有内容