我需要获取html页面的原始文本,但只需要获得h1标题之后的文本。
我可以像这样获得主体的h1:
soup = BeautifulSoup(content.decode('utf-8','ignore'), 'html.parser')
extracted_h1 = soup.body.h1
我的想法是这样的,获取所有元素并将它们与我上面提取的h1进行比较。然后将h1之后的所有元素追加到一个单独的列表中,之后获取列表中所有已保存的元素,并在它们上使用getText()。
# find all html elements
found = soup.findAll() # text=True
fill_element = list()
for element in found:
# something like this, but it doesn't work
if element == extracted_h1:
# after this start appending the elements to fill_element list
但这不起作用。任何想法如何实现这一目标?
答案 0 :(得分:1)
假设您使用的是BeautifulSoup 4.4,您有这种方法:
soup.body.h1.find_all_next(string=True)
首先获取h1
之后的所有元素,第一个是h1
本身的文本。
答案 1 :(得分:1)
为什么不在h1
代码上试用find_all_next
并获取文字属性?
示例:
>>> import bs4
>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
... <body>
... <p class="title"><b>The Dormouse's story</b></p>
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <!-- START--><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p><!-- END -->
... <p class="story">...</p>
... """
...
>>> soup = bs4.BeautifulSoup(html_doc, 'html.parser')
>>> print(soup.text)
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
>>> print(''.join(soup.find_all('p')[1].find_all_next(text=True)))
Once upon a time there were three little sisters; and their names were
Elsie,
STARTLacie and
Tillie;
and they lived at the bottom of a well. END
...