使用Python从数据转储中提取电子邮件

时间:2016-06-07 15:56:24

标签: python parsing web-scraping beautifulsoup

我有一个数据转储,我试图从中提取所有电子邮件。

这是我使用BeautifulSoup编写的代码

import urllib2
import re
from bs4 import BeautifulSoup
url = urllib2.urlopen("file:///users/home/Desktop/emails.html").read()
soup = BeautifulSoup(url)
email = raw_input(soup)
match = re.findall(r'<(.*?)>', email)
if match:
    print match

示例数据转储

<tr><td><a href="http://abc.gov.com/comments/24-April/file.html">for educational purposes only</a></td>
<td>7418681641 &lt;sampleemail@gmail.com&gt;</td>
<td>advqos@abc.gov.com</td>
<td nowrap="">24-04-2015 10.31</td>
<td align="center">&nbsp;</td></tr>
<tr><td><a href="http://abc.gov.com/comments/24-April/test.html">no_subject</a></td>
<td>John &lt;someemail@gmail.com&gt;</td>
<td>advqos@abc.gov.com</td>
<td nowrap="">24-04-2015 11.28</td>
<td align="center">&nbsp;</td></tr>
<tr><td><a href="http://abc.gov.com/comments/24-April/test.html">something</a></td>
<td>Mark &lt;123random@gmail.com&gt;</td>
<td>test@abc.gov.com</td>
<td nowrap="">24-04-2015 11.28</td>
<td align="center">&nbsp;</td></tr>
<tr><td><a href="http://abc.gov.com/comments/24-April/abc.html">some data</a></td>

我可以清楚地看到电子邮件列在&lt;&gt;标记之间。我正在尝试使用正则表达式来识别所有电子邮件并打印它们。但是,在执行时,不是仅提取电子邮件(每行一封电子邮件),而是打印整个文件。

我该如何解决这个问题?

3 个答案:

答案 0 :(得分:1)

您的示例实际上有效

this.pbPhoto.DataBindings.Add("Image", employee, "Photo", true,
    DataSourceUpdateMode.OnValidation, Resources.Blank);

答案 1 :(得分:0)

假设您的数据转储在名为text.txt的文本文件中:

import re
# Make sure the text file is in the same folder as the python file.
with open('text.txt','r') as f:
    matches = re.findall(r'&lt;(.+?)&gt;',f.read())
print('\n'.join(matches))

答案 2 :(得分:-1)

您可以使用find_all BeautifulSoup方法解析您要查找的标记。这是代码。 (我已将示例文件存储为a.html

from bs4 import BeautifulSoup
url = open("a.html",'r').read()
soup = BeautifulSoup(url)
rows = soup.find_all('tr') # find all rows using tag 'tr'
for row in rows:
    cols = row.find_all('td')  # find all columns using 'td' tag
    if len(cols)>1:
        email_id_string = cols[1].text # get the text of second element of list (contains email id element)
        email_id = email_id_string[ email_id_string.find("<")+1 : email_id_string.find(">") ] (get only the email id between < and > )
        print email_id