Question

我正在使用BeautifulSoup从HTML电子邮件中提取纯文本。除了一个问题，我一切都很好。我的电子邮件通常在顶部消息下方包含回复。因此，我对电子邮件进行了线程处理，最终导致重复捕获相同的文本。在大多数情况下，我只想摆脱找到的第一个<div>标记之后的所有内容。如果我打印soup.contents，它将输出以下内容：

p None p None p None p None p None p None p None p None p None p None p None div None meta None style None div None p

我希望返回一个传递了所有div标签的所有东西的BeautifulSoup对象。

HTMLwise，这是我要去的之前和之后：

之前：

<p> Hi Joe </p>
<p> I will be at the meeting tonight</p>
<p> Allison </p>

<div style='border-width: 1pt medium medium; border-style: solid none none; border-color: rgb(181, 196, 223) currentColor currentColor; font-family: "Arial","sans-serif";'>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>From: </b>John Doe &lt;jdoe@example.com&gt;</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>Sent: </b>Wednesday, May 30, 2018 6:48 AM</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>To: </b>Allison &lt;allison@example.com&gt;</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>Subject: </b>RE: meeting tonight</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
                 </p>
</div>

<p>Will you be at the meeting tonight?</p>

之后：

<p> Hi Joe </p>
<p> I will be at the meeting tonight</p>
<p> Allison </p>

Answer 1

在这种情况下，最简单的方法是运行re并删除第一个<div>标记之后的所有内容：

s = """<p> Hi Joe </p>
<p> I will be at the meeting tonight</p>
<p> Allison </p>

<div style='border-width: 1pt medium medium; border-style: solid none none; border-color: rgb(181, 196, 223) currentColor currentColor; font-family: "Arial","sans-serif";'>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>From: </b>John Doe &lt;jdoe@example.com&gt;</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>Sent: </b>Wednesday, May 30, 2018 6:48 AM</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>To: </b>Allison &lt;allison@example.com&gt;</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>Subject: </b>RE: meeting tonight</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
                 </p>
</div>

<p>Will you be at the meeting tonight?</p>"""

import re

new_s = re.sub(r'<div.*', '', s, flags=re.DOTALL).strip()
print(new_s)

打印：

<p> Hi Joe </p>
<p> I will be at the meeting tonight</p>
<p> Allison </p>

然后您可以将此新字符串提供给BeautifulSoup：

from bs4 import BeautifulSoup
soup = BeautifulSoup(re.sub(new_s, 'lxml')

print(soup.prettify())

输出：

<html>
 <body>
  <p>
   Hi Joe
  </p>
  <p>
   I will be at the meeting tonight
  </p>
  <p>
   Allison
  </p>
 </body>
</html>

在BeautifulSoup中删除标记后的所有内容

1 个答案: