我有一个数据转储,我试图从中提取所有电子邮件。
这是我使用BeautifulSoup编写的代码
import urllib2
import re
from bs4 import BeautifulSoup
url = urllib2.urlopen("file:///users/home/Desktop/emails.html").read()
soup = BeautifulSoup(url)
email = raw_input(soup)
match = re.findall(r'<(.*?)>', email)
if match:
print match
示例数据转储
<tr><td><a href="http://abc.gov.com/comments/24-April/file.html">for educational purposes only</a></td>
<td>7418681641 <sampleemail@gmail.com></td>
<td>advqos@abc.gov.com</td>
<td nowrap="">24-04-2015 10.31</td>
<td align="center"> </td></tr>
<tr><td><a href="http://abc.gov.com/comments/24-April/test.html">no_subject</a></td>
<td>John <someemail@gmail.com></td>
<td>advqos@abc.gov.com</td>
<td nowrap="">24-04-2015 11.28</td>
<td align="center"> </td></tr>
<tr><td><a href="http://abc.gov.com/comments/24-April/test.html">something</a></td>
<td>Mark <123random@gmail.com></td>
<td>test@abc.gov.com</td>
<td nowrap="">24-04-2015 11.28</td>
<td align="center"> </td></tr>
<tr><td><a href="http://abc.gov.com/comments/24-April/abc.html">some data</a></td>
我可以清楚地看到电子邮件列在<
和>
标记之间。我正在尝试使用正则表达式来识别所有电子邮件并打印它们。但是,在执行时,不是仅提取电子邮件(每行一封电子邮件),而是打印整个文件。
我该如何解决这个问题?
答案 0 :(得分:1)
您的示例实际上有效
this.pbPhoto.DataBindings.Add("Image", employee, "Photo", true,
DataSourceUpdateMode.OnValidation, Resources.Blank);
答案 1 :(得分:0)
import re
# Make sure the text file is in the same folder as the python file.
with open('text.txt','r') as f:
matches = re.findall(r'<(.+?)>',f.read())
print('\n'.join(matches))
答案 2 :(得分:-1)
您可以使用find_all
BeautifulSoup
方法解析您要查找的标记。这是代码。 (我已将示例文件存储为a.html
)
from bs4 import BeautifulSoup
url = open("a.html",'r').read()
soup = BeautifulSoup(url)
rows = soup.find_all('tr') # find all rows using tag 'tr'
for row in rows:
cols = row.find_all('td') # find all columns using 'td' tag
if len(cols)>1:
email_id_string = cols[1].text # get the text of second element of list (contains email id element)
email_id = email_id_string[ email_id_string.find("<")+1 : email_id_string.find(">") ] (get only the email id between < and > )
print email_id