Question

我使用BeautifulSoup从网站上抓取文字，但我只想要<p>标签进行组织。但是，我无法使用text.findAll('p')，因为还有其他<p>标签我不想要。

我想要的文字全部包含在一个标签内（让我们说是正文），但是当我解析它时，它还包含该标签。

link = requests.get('link')
text = bs4.BeautifulSoup(link.text, 'html.parser').find('body')

如何删除正文标记？

Answer 1

text = bs4.BeautifulSoup(link.text, 'html.parser').find('body').text

这将连接body标记中的所有文本。

Answer 2

这可能会对您有所帮助：

>>> txt = """\
<p>Rahul</p>
<p><i>White</i></p>
<p>City <b>Beston</b></p>
"""

>>> soup = BeautifulSoup(txt, "html.parser")
>>> print("".join(soup.strings))

Rahul
White
City Beston

或者你可以这样做：

soup = BeautifulSoup(html)
bodyTag = soup.find('body')
bodyText = BeautifulSoup(bodyTag, "html.parser")
print bodyText.strings

提取标记内的所有内容，但不标记自身

2 个答案: