Question

我正在尝试从包含多封电子邮件的.txt文件（https://www.py4e.com/code3/mbox.txt）中提取唯一的电子邮件地址列表。通过使用以下程序将搜索范围缩小到“发件人：”和“收件人：”行，我可以提取电子邮件地址列表：

import re
in_file = open('dummy_text_file.txt')
for line in in_file:
if re.findall('^From:.+@([^\.]*)\.', line):
    countFromEmail = countFromEmail + 1
    print(line)
if re.findall('^To:.+@([^\.]*)\.', line):
    print(line)

但是，这并没有为我提供唯一列表，因为各种电子邮件地址会重复出现。此外，最终被打印的内容如下所示：

收件人：java-user@lucene.apache.org

来自：Adrien Grand

我希望仅列出实际的电子邮件地址，而不会包含“至”，“发件人”或尖括号（<>）。

我不太熟悉Python，但是我最初的处理方法是提取纯电子邮件地址，然后将其存储在某个地方并创建一个for循环以将其添加到列表中。

任何帮助或正确方向的指点。

Answer 1

要获取唯一电子邮件列表，请查看以下两篇文章：

https://www.peterbe.com/plog/uniqifiers-benchmark

How do you remove duplicates from a list whilst preserving order?

要将Adrien Grand < jpountz@gmail.com >解析为其他格式，以下链接应包含您需要的所有信息。

https://docs.python.org/3.7/library/email.parser.html#module-email.parser

不幸的是，我没有时间写一个例子，但我希望这会有所帮助。

Answer 2

最简单的方法是set()。

集合仅包含唯一值。

array = [1, 2, 3, 4, 5, 5, 5]
unique_array= set(array)
print(unique_array)  # {1, 2, 3, 4, 5}

Answer 3

import re
in_file = open('mbox.txt')
countFromEmail = 0
unique_emails = set() #using a set to maintain an unique list
for line in in_file:
    if re.findall('^From:.+@([^\.]*)\.', line):
        countFromEmail += 1
        line = line.replace("From:","") #replacing the string
        line = line.strip() # then trimming the white spaces
        unique_emails.add(line) #adding to the set
    if re.findall('^To:.+@([^\.]*)\.', line):
        line = line.replace("To:","") #replacing the string
        line = line.strip() #then trimming the white spaces
        unique_emails.add(line) #adding to the set
for email in unique_emails:
    print email

您可以通过许多不同的方式来获得此结果。使用集合的集合可以是其中之一。由于集合中的元素是唯一的（插入时默认情况下会丢弃所有重复的元素）。

Read more here for the unordered collection of unique elements (SET) in python

我已经为您编辑并注释了您的代码。希望这可以帮助。干杯! ：）

-孙俊

如何从Python文件中提取唯一电子邮件地址列表

3 个答案: