Question

目前，我正在这里抓新闻网站进行研究，我按如下方式使用python + BeautifulSoup

newsPageSoup = BeautifulSoup(newsPage.content, 'html.parser', from_encoding="iso 639-3")
newsText = newsPageSoup.find(class_='post-content').get_text()

从以下html代码获取文本部分。效果很好。

<p class="post-content">The completion of the sixth review, upon the granting of a waiver of non‑observance for the end‑June 2019, performance criterion on the primary balance</p>

但是情况是我想从以下内容中提取文本部分安德鲁

<p class="text-primary" style="color : #2793ed; font:Arial, Helvetica, sans-serif; font-size:14px; font-weight:normal">Andrew <small style="color:#999999; font-size:11px">Friday, 13 December 2019 07:58 PM </small> </p>

所以我使用了与上面相同的python代码

readerNames = newsPageSoup.find(class_='text-primary').get_text()

但是会出现以下错误

AttributeError: 'NoneType' object has no attribute 'get_text'

我认为这是因为<small>标记内的<p>标记。所以他们有办法吗？请帮助

Answer 1

您可以按以下方式访问文本值：

import bs4
l = '<p class="text-primary" style="color : #2793ed; font:Arial, Helvetica, sans-serif; font-size:14px; font-weight:normal">Andrew <small style="color:#999999;font-size:11px">Friday, 13 December 2019 07:58 PM </small> </p>'
newsPageSoup = bs4.BeautifulSoup(l)
readerNames = newsPageSoup.find(class_='text-primary').text

如何使用<small>标记提取<p>内的文本

1 个答案: