Question

我想提取电子邮件内容。它是html内容，使用 BeautifulSoup 来获取From，To和subject。在获取正文内容时，它仅获取第一行。它留下了剩余的行和段落。

我在这里想念一些东西，如何阅读所有的行/段落。

CODE：

/var/www/Project1

控制台：

email_message = mail.getEmail(unreadId)
print (email_message['From'])
print (email_message['Subject'])

if email_message.is_multipart():
    for payload in email_message.get_payload():
        bodytext = email_message.get_payload()[0].get_payload()
        if type(bodytext) is list:
            bodytext = ','.join(str(v) for v in bodytext)
else:
    bodytext = email_message.get_payload()[0].get_payload()
    if type(bodytext) is list:
        bodytext = ','.join(str(v) for v in bodytext)
print (bodytext)
parsedContent = BeautifulSoup(bodytext)
body = parsedContent.findAll('p').getText()
print body

当我使用

时

body = parsedContent.findAll('p').getText()
AttributeError: 'list' object has no attribute 'getText'

它获取内容的第一行，而不是打印剩余的行。

加

获取html标记中的所有行后，我会在每行的末尾得到=符号，并且还会显示＆amp; nbsp; ，＆amp; lt 。如何克服这些。

提取的文字

亲爱的先生，GenWatt的所有人都很高兴将xyz作为一个顾客。我想将自己介绍为您的帐户经理。如果您有任何疑问，请随意给我打电话或发电子邮件至ash = wis@xyz.com。您也可以通过以下号码联系GenWatt：主要： 810-543-1100销售：810-545-1222客户服务＆amp;支持： 810-542-1233传真：810-545-1001我相信GenWatt会为您服务好，希望看到我们的关系=

Answer 1

让我们检查一下soup.findAll('p')

的结果

python -i test.py
----------
import requests
from bs4 import BeautifulSoup

bodytext = requests.get("https://en.wikipedia.org/wiki/Earth").text
parsedContent = BeautifulSoup(bodytext, 'html.parser')

paragraphs = soup.findAll('p')
----------

>> type(paragraphs)
<class 'bs4.element.ResultSet'> 
>> issubclass(type(paragraphs), list) 
True # It's a list

你能看到吗？这是所有段落的列表。如果要访问其内容，则需要遍历列表或通过索引访问元素，如普通列表。

>> # You can print all content with a for-loop
>> for p in paragraphs:
>>     print p.getText()
Earth (otherwise known as the world (...)
According to radiometric dating and other sources of evidence (...)
...    

>> # Or you can join all content
>> content = []
>> for p in paragraphs:
>>     content.append(p.getText())
>> 
>> all_content = "\n".join(content)
>>
>> print(all_content)
Earth (otherwise known as the world (...) According to radiometric dating and other sources of evidence (...)

使用List Comprehension您的代码将如下所示：

parsedContent = BeautifulSoup(bodytext)
body = '\n'.join([p.getText() for p in parsedContent.findAll('p')]

当我使用
时
body = parsedContent.find('p').getText()
它获取内容的第一行并且不打印剩下的一行。

parsedContent.find('p')与parsedContent.findAll('p')[0]

完全相同the same

>> parsedContent.findAll('p')[0].getText() == parsedContent.find('p').getText()
True

如何使用Beautiful Soup解析HTML中的所有文本内容

1 个答案: