如何从特定域中排除电子邮件地址并以Python方式提取其他地址

时间:2019-03-13 19:20:10

标签: python regex list

我有一个电子邮件地址列表,其中一些来自相关域,而其他则来自垃圾邮件/无关电子邮件域。我想同时捕获这两个,但要在单独的列表中。我知道相关邮件来自何处(总是相同的域-@gmail.com,但垃圾邮件来自不同的邮件,都需要捕获它们)。

    # Extract all email ids from a JSON file
    import re
    import json

     with open("test.json", 'r') as fp:
         json_decode = json.loads(fp.read())

         line = str(json_decode)

         match = re.findall(r'[\w\.-]+@[\w.-]+', line)
         l = len(match)
         print(match)

         for i in match:
             domain = match.split('@')[i]


        OUTPUT: match = ['image001.png@01D36CD8.2A2219D0', 'arealjcl@countable.us', 'taylor.l.ingram@gmail.com']

前两个是垃圾邮件,第三个是合法电子邮件,因此它们必须位于不同的列表中。我要在@处进行拆分以确定域还是排除所有非@gmail.com的内容并转储到另一个列表中。

3 个答案:

答案 0 :(得分:1)

我建议您使用endswith()函数。这是使用方法:

legit = []
spam = []

# We iterate through the list of matches
for email in match:

    # This checks if the email ends with @gmail.com.
    # If it returns True, that means it is a good email.
    # But, if it returns False, then it means that the email
    # is spam.
    email_status = email.endswith("@gmail.com")


    if email_status == False:
        spam.append(email)

    else:
        legit.append(email)

编辑:更改了代码,以便其正确回答您的问题

答案 1 :(得分:0)

'@'上拆分电子邮件地址时,将获得两个项目列表:

In [3]: 'image001.png@01D36CD8.2A2219D0'.split('@')
Out[3]: ['image001.png', '01D36CD8.2A2219D0']

如果要检查 domain 索引结果的第二项:

In [4]: q = 'image001.png@01D36CD8.2A2219D0'.split('@')

In [5]: q[1]
Out[5]: '01D36CD8.2A2219D0'

所以您的for循环更像是:

In [9]: for thing in match:
   ...:     domain = thing.split('@')[1]
   ...:     print(domain)
   ...:     
01D36CD8.2A2219D0
countable.us
gmail.com

答案 2 :(得分:0)

您可以按定义的相关域将它们分为两个列表

 # extract all email ids from a json file
 import re
 import json

 relevant_domains = ['@gmail.com'] # you can add more

 with open("test.json", 'r') as fp:
     json_decode = json.loads(fp.read())

     line = str(json_decode)

     match = re.findall(r'[\w\.-]+@[\w.-]+', line)
     l = len(match)
     print(match)

     relevant_emails = []
     spam_emails = []

     for email in match:
         domain = email.split('@')[1]

         if domain in relevant_domains:
             relevant_emails.append(email)
         else:
             spam_emails.append(email)