我正在使用BeautifulSoup从HTML电子邮件中提取纯文本。除了一个问题,我一切都很好。我的电子邮件通常在顶部消息下方包含回复。因此,我对电子邮件进行了线程处理,最终导致重复捕获相同的文本。在大多数情况下,我只想摆脱找到的第一个<div>
标记之后的所有内容。如果我打印soup.contents
,它将输出以下内容:
p
None
p
None
p
None
p
None
p
None
p
None
p
None
p
None
p
None
p
None
p
None
div
None
meta
None
style
None
div
None
p
我希望返回一个传递了所有div标签的所有东西的BeautifulSoup对象。
HTMLwise,这是我要去的之前和之后:
之前:
<p> Hi Joe </p>
<p> I will be at the meeting tonight</p>
<p> Allison </p>
<div style='border-width: 1pt medium medium; border-style: solid none none; border-color: rgb(181, 196, 223) currentColor currentColor; font-family: "Arial","sans-serif";'>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>From: </b>John Doe <jdoe@example.com></p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>Sent: </b>Wednesday, May 30, 2018 6:48 AM</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>To: </b>Allison <allison@example.com></p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>Subject: </b>RE: meeting tonight</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
</p>
</div>
<p>Will you be at the meeting tonight?</p>
之后:
<p> Hi Joe </p>
<p> I will be at the meeting tonight</p>
<p> Allison </p>
答案 0 :(得分:1)
在这种情况下,最简单的方法是运行re
并删除第一个<div>
标记之后的所有内容:
s = """<p> Hi Joe </p>
<p> I will be at the meeting tonight</p>
<p> Allison </p>
<div style='border-width: 1pt medium medium; border-style: solid none none; border-color: rgb(181, 196, 223) currentColor currentColor; font-family: "Arial","sans-serif";'>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>From: </b>John Doe <jdoe@example.com></p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>Sent: </b>Wednesday, May 30, 2018 6:48 AM</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>To: </b>Allison <allison@example.com></p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
<b>Subject: </b>RE: meeting tonight</p>
<p style="margin: 2px 0px; padding: 0px; color: rgb(34, 34, 34); font-family: Arial; font-size: 10pt; background-color: rgb(255, 255, 255);">
</p>
</div>
<p>Will you be at the meeting tonight?</p>"""
import re
new_s = re.sub(r'<div.*', '', s, flags=re.DOTALL).strip()
print(new_s)
打印:
<p> Hi Joe </p>
<p> I will be at the meeting tonight</p>
<p> Allison </p>
然后您可以将此新字符串提供给BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(re.sub(new_s, 'lxml')
print(soup.prettify())
输出:
<html>
<body>
<p>
Hi Joe
</p>
<p>
I will be at the meeting tonight
</p>
<p>
Allison
</p>
</body>
</html>