我正在尝试使用BeautifulSoup从使用ExchangeLib返回的内容中抓取HTML标记。到目前为止,我的情况是这样:
from exchangelib import Credentials, Account
import urllib3
from bs4 import BeautifulSoup
credentials = Credentials('myemail@notreal.com', 'topSecret')
account = Account('myemail@notreal.com', credentials=credentials, autodiscover=True)
for item in account.inbox.all().order_by('-datetime_received')[:1]:
soup = BeautifulSoup(item.unique_body, 'html.parser')
print(soup)
照原样,这将使用exchangeLib通过Exchange从我的收件箱中提取第一封电子邮件,并专门打印包含电子邮件正文的unique_body
。以下是print(soup)
的输出示例:
<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>
我的最终目标是打印出来:
Hey John,
Here is a test email
根据我在BeautifulSoup文档中所读的内容,抓取过程介于我的“ Soup ="
行和最后的print
行之间。
我的问题是,要运行BeautifulSoup的抓取部分,它需要一个类和h1标签,例如:name_box = soup.find(‘h1’, attrs={‘class’: ‘name’})
,但是从我目前的情况来看,我什么都没有。
作为Python的新手,我应该怎么做呢?
答案 0 :(得分:2)
您可以尝试Find_all
获取所有font
标签值,然后进行迭代。
from bs4 import BeautifulSoup
html="""<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>"""
soup = BeautifulSoup(html, "html.parser")
for span in soup.find_all('font'):
print(span.text)
输出:
Hey John,
Here is a test email
答案 1 :(得分:0)
您需要打印字体标签的内容。您可以使用select
方法并将其传递给font
元素的类型选择器。
from bs4 import BeautifulSoup as bs
html = '''
<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>
'''
soup = bs(html, 'lxml')
textStuff = [item.text for item in soup.select('font') if item.text != ' ']
print(textStuff)