Question

我想抓住描述之后和下一个标题之前的文本。

我知道：

In [8]: soup.findAll('h2')[6]
Out[8]: <h2>Description</h2>

但是，我不知道如何抓取实际文本。问题是我有多个链接来执行此操作。有些人有p：

                                         <h2>Description</h2>

  <p>This is the text I want </p>
<p>This is the text I want</p>   
                                        <h2>Next header</h2>

但是，有些人不这样做：

>                                       <h2>Description</h2>
>                        This is the text I want                 
> 
>                                       <h2>Next header</h2>

另外每个人都有p，我不能只做汤.findAll（'p'）[22]因为有些'p'是21或20。

Answer 1

检查NavigableString以检查下一个兄弟是文本节点还是Tag以检查它是否是元素。

如果您的下一个兄弟是标题，则打破循环。

from bs4 import BeautifulSoup, NavigableString, Tag
import requests

example = """<h2>Description</h2><p>This is the text I want </p><p>This is the text I want</p><h2>Next header</h2>"""

soup = BeautifulSoup(example, 'html.parser')
for header in soup.find_all('h2'):
    nextNode = header
    while True:
        nextNode = nextNode.nextSibling
        if nextNode is None:
            break
        if isinstance(nextNode, NavigableString):
            print (nextNode.strip())
        if isinstance(nextNode, Tag):
            if nextNode.name == "h2":
                break
            print (nextNode.get_text(strip=True).strip())

使用BeautifulSoup

1 个答案: