使用BeautifulSoup返回正文

时间:2019-03-01 18:21:01

标签: python email web-scraping beautifulsoup

我正在尝试使用BeautifulSoup从使用ExchangeLib返回的内容中抓取HTML标记。到目前为止,我的情况是这样:

from exchangelib import Credentials, Account
import urllib3
from bs4 import BeautifulSoup

credentials = Credentials('myemail@notreal.com', 'topSecret')
account = Account('myemail@notreal.com', credentials=credentials, autodiscover=True)

for item in account.inbox.all().order_by('-datetime_received')[:1]:
    soup = BeautifulSoup(item.unique_body, 'html.parser')
    print(soup)

照原样,这将使用exchangeLib通过Exchange从我的收件箱中提取第一封电子邮件,并专门打印包含电子邮件正文的unique_body。以下是print(soup)的输出示例:

<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>

我的最终目标是打印出来:

Hey John,
Here is a test email

根据我在BeautifulSoup文档中所读的内容,抓取过程介于我的“ Soup ="行和最后的print行之间。

我的问题是,要运行BeautifulSoup的抓取部分,它需要一个类和h1标签,例如:name_box = soup.find(‘h1’, attrs={‘class’: ‘name’}),但是从我目前的情况来看,我什么都没有。

作为Python的新手,我应该怎么做呢?

2 个答案:

答案 0 :(得分:2)

您可以尝试Find_all获取所有font标签值,然后进行迭代。

from bs4 import BeautifulSoup
html="""<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>"""

soup = BeautifulSoup(html, "html.parser")
for span in soup.find_all('font'):
      print(span.text)

输出:

Hey John,

Here is a test email

答案 1 :(得分:0)

您需要打印字体标签的内容。您可以使用select方法并将其传递给font元素的类型选择器。

from bs4 import BeautifulSoup as bs

html = '''
<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>
'''

soup = bs(html, 'lxml')

textStuff = [item.text for item in soup.select('font') if item.text != ' ']
print(textStuff)