Question

我正在尝试使用BeautifulSoup从使用ExchangeLib返回的内容中抓取HTML标记。到目前为止，我的情况是这样：

from exchangelib import Credentials, Account
import urllib3
from bs4 import BeautifulSoup

credentials = Credentials('myemail@notreal.com', 'topSecret')
account = Account('myemail@notreal.com', credentials=credentials, autodiscover=True)

for item in account.inbox.all().order_by('-datetime_received')[:1]:
    soup = BeautifulSoup(item.unique_body, 'html.parser')
    print(soup)

照原样，这将使用exchangeLib通过Exchange从我的收件箱中提取第一封电子邮件，并专门打印包含电子邮件正文的unique_body。以下是print(soup)的输出示例：

<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>

我的最终目标是打印出来：

Hey John,
Here is a test email

根据我在BeautifulSoup文档中所读的内容，抓取过程介于我的“ Soup ="行和最后的print行之间。

我的问题是，要运行BeautifulSoup的抓取部分，它需要一个类和h1标签，例如：name_box = soup.find(‘h1’, attrs={‘class’: ‘name’})，但是从我目前的情况来看，我什么都没有。

作为Python的新手，我应该怎么做呢？

Answer 1

您可以尝试Find_all获取所有font标签值，然后进行迭代。

from bs4 import BeautifulSoup
html="""<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>"""

soup = BeautifulSoup(html, "html.parser")
for span in soup.find_all('font'):
      print(span.text)

输出：

Hey John,

Here is a test email

Answer 2

您需要打印字体标签的内容。您可以使用select方法并将其传递给font元素的类型选择器。

from bs4 import BeautifulSoup as bs

html = '''
<html><body><div>
<div><span lang="en-US">
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Hey John,</span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;"> </span></font></div>
<div style="margin:0;"><font face="Calibri,sans-serif" size="2"><span style="font-size:11pt;">Here is a test email</span></font></div>
</span></div>
</div>
</body></html>
'''

soup = bs(html, 'lxml')

textStuff = [item.text for item in soup.select('font') if item.text != ' ']
print(textStuff)

使用BeautifulSoup返回正文

2 个答案: