电子邮件字符串:
can i buy a laptop<br><br>-- <br>
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div dir="ltr">
<p style="color:rgb(0,0,0);font-family:times;font-size:medium">
Some important Text/ Email Signature
</p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div><br>
需要的输出:
{
body: "can i buy a laptop",
Signature: "Some important Text/ Email Signature"
}
另一个问题是,电子邮件文本是动态的。也可能像这样:
<div dir="ltr">Can i buy a phone?<br clear="all">
<div><br>-- <br>
<div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">
<div dir="ltr"><span>
<div dir="ltr"><span style="color:rgb(136,136,136)"></span>
<div>
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div> Some Important Divs</div>
</div>
</div>
</div>
</div>
</div>
</span></div>
</div>
</div>
</div>
因此不能真正通过'ltr'标签确定。到目前为止,我一直使用ltr标签提取第一部分,并使用gmail_signature进行签名。
soup = BeautifulSoup(emailText, 'html.parser')
mainbody = soup.find('div', {'dir': 'ltr'})
if mainbody is not None:
texts = [t for t in mainbody.contents if isinstance(t, NavigableString)]
print('Mainbody: ', mainbody)
print('Texts: ', texts)
if len(texts) != 0:
for idx,txt in enumerate(texts):
allText += txt
if idx != len(texts):
allText += "\n"
quotes = soup.find('div', {'class': 'gmail_quote'})
if quotes is not None:
for div in quotes:
replies += " " + div.text
# replies = replies.replace("\n", "")
replies = replies.replace("\r", "")
replies = re.sub(' +', ' ',replies)
答案 0 :(得分:1)
尝试: 第二个例子:
import requests
from bs4 import BeautifulSoup
data=dict()
html_page = """<div dir="ltr">Can i buy a phone?<br clear="all">
<div><br>-- <br>
<div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">
<div dir="ltr"><span>
<div dir="ltr"><span style="color:rgb(136,136,136)"></span>
<div>
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div> Some Important Divs</div>
</div>
</div>
</div>
</div>
</div>
</span></div>
</div>
</div>
</div>"""
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
#'[document]',
#'noscript',
#'header',
'html',
#'meta',
#'head',
#'input',
#'script',
# there may be more elements you don't want, such as "style", etc.
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
if "--" in output:
res=output.replace("\n","").split("--")
else:
res=output.replace("\n","").split("Best Regards ")
data["body"]=res[0]
data["signature"]=res[1].strip()
print(data)
输出:
{'body': 'Can i buy a phone? ', 'signature': 'Some Important Divs'}
第一个:
import requests
from bs4 import BeautifulSoup
data=dict()
html_page = """can i buy a laptop<br><br>-- <br>
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div dir="ltr">
<p style="color:rgb(0,0,0);font-family:times;font-size:medium">
Some important Text/ Email Signature
</p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div><br>"""
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
#'[document]',
#'noscript',
#'header',
'html',
#'meta',
#'head',
#'input',
#'script',
# there may be more elements you don't want, such as "style", etc.
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
if "--" in output:
res=output.replace("\n","").split("--")
else:
res=output.replace("\n","").split("Best Regards ")
data["body"]=res[0]
data["signature"]=res[1].strip()
print(data)
输出:
{'body': 'can i buy a laptop ', 'signature': 'Some important Text/ Email Signature'}