Question

我需要获取html页面的原始文本，但只需要获得h1标题之后的文本。

我可以像这样获得主体的h1：

soup = BeautifulSoup(content.decode('utf-8','ignore'), 'html.parser')
extracted_h1 = soup.body.h1

我的想法是这样的，获取所有元素并将它们与我上面提取的h1进行比较。然后将h1之后的所有元素追加到一个单独的列表中，之后获取列表中所有已保存的元素，并在它们上使用getText（）。

# find all html elements
found = soup.findAll() # text=True
fill_element = list()
for element in found:
    # something like this, but it doesn't work
    if element == extracted_h1:
       # after this start appending the elements to fill_element list

但这不起作用。任何想法如何实现这一目标？

Answer 1

假设您使用的是BeautifulSoup 4.4，您有这种方法：

soup.body.h1.find_all_next(string=True)

首先获取h1之后的所有元素，第一个是h1本身的文本。

Answer 2

为什么不在h1代码上试用find_all_next并获取文字属性？

示例：

>>> import bs4
>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
... <body>
... <p class="title"><b>The Dormouse's story</b></p>
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <!-- START--><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p><!-- END -->
... <p class="story">...</p>
... """
...
>>> soup = bs4.BeautifulSoup(html_doc, 'html.parser')
>>> print(soup.text)
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

>>> print(''.join(soup.find_all('p')[1].find_all_next(text=True)))

Once upon a time there were three little sisters; and their names were
Elsie,
 STARTLacie and
Tillie;
and they lived at the bottom of a well. END 
...

在h1之后使用Python中的美丽汤获取文本

2 个答案: