我有以下html
代码:
html_doc = """
<h2> API guidance for developers</h2>
<h2>Images</h2>
<h2>Score descriptors</h2>
<h2>Downloadable XML data files (updated daily)</h2>
<h2>
East Counties</h2>
<h2>
East Midlands</h2>
<h2>
London</h2>
<h2>
North East</h2>
<h2>
North West</h2>
<h2>
South East</h2>
<h2>
South West</h2>
<h2>
West Midlands</h2>
<h2>
Yorkshire and Humberside</h2>
<h2>
Northern Ireland</h2>
<h2>
Scotland</h2>
<h2>
Wales</h2>
"""
如何跳过前四行并访问East Counties
等文本字符串?
我的尝试不会跳过前四行并返回字符串包括代码中嵌入的许多空格(我想要删除):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for h2 in soup.find_all('h2'):
next
next
next
next
print (str(h2.children.next()))
期望的结果:
East Counties
East Midlands
London
North East
...
我做错了什么?
答案 0 :(得分:4)
您可以在此使用slicing
,因为find_all
会返回列表类型,因此您可以使用它的索引,例如[4:]
,并忽略空格使用{{ 1}}
strip()
答案 1 :(得分:2)
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for h2 in soup.find_all('h2')[4:]: # slicing to skip the first 4 elements
print(h2.text.strip()) # get the inner text of the tag and then strip the white space