我有一个电子邮件地址列表,其中一些来自相关域,而其他则来自垃圾邮件/无关电子邮件域。我想同时捕获这两个,但要在单独的列表中。我知道相关邮件来自何处(总是相同的域-@gmail.com
,但垃圾邮件来自不同的邮件,都需要捕获它们)。
# Extract all email ids from a JSON file
import re
import json
with open("test.json", 'r') as fp:
json_decode = json.loads(fp.read())
line = str(json_decode)
match = re.findall(r'[\w\.-]+@[\w.-]+', line)
l = len(match)
print(match)
for i in match:
domain = match.split('@')[i]
OUTPUT: match = ['image001.png@01D36CD8.2A2219D0', 'arealjcl@countable.us', 'taylor.l.ingram@gmail.com']
前两个是垃圾邮件,第三个是合法电子邮件,因此它们必须位于不同的列表中。我要在@
处进行拆分以确定域还是排除所有非@gmail.com
的内容并转储到另一个列表中。
答案 0 :(得分:1)
我建议您使用endswith()
函数。这是使用方法:
legit = []
spam = []
# We iterate through the list of matches
for email in match:
# This checks if the email ends with @gmail.com.
# If it returns True, that means it is a good email.
# But, if it returns False, then it means that the email
# is spam.
email_status = email.endswith("@gmail.com")
if email_status == False:
spam.append(email)
else:
legit.append(email)
编辑:更改了代码,以便其正确回答您的问题
答案 1 :(得分:0)
在'@'
上拆分电子邮件地址时,将获得两个项目列表:
In [3]: 'image001.png@01D36CD8.2A2219D0'.split('@')
Out[3]: ['image001.png', '01D36CD8.2A2219D0']
如果要检查 domain 索引结果的第二项:
In [4]: q = 'image001.png@01D36CD8.2A2219D0'.split('@')
In [5]: q[1]
Out[5]: '01D36CD8.2A2219D0'
所以您的for循环更像是:
In [9]: for thing in match:
...: domain = thing.split('@')[1]
...: print(domain)
...:
01D36CD8.2A2219D0
countable.us
gmail.com
答案 2 :(得分:0)
您可以按定义的相关域将它们分为两个列表
# extract all email ids from a json file
import re
import json
relevant_domains = ['@gmail.com'] # you can add more
with open("test.json", 'r') as fp:
json_decode = json.loads(fp.read())
line = str(json_decode)
match = re.findall(r'[\w\.-]+@[\w.-]+', line)
l = len(match)
print(match)
relevant_emails = []
spam_emails = []
for email in match:
domain = email.split('@')[1]
if domain in relevant_domains:
relevant_emails.append(email)
else:
spam_emails.append(email)