Question

我有以下html代码：

html_doc = """
<h2> API guidance for developers</h2>
<h2>Images</h2>
<h2>Score descriptors</h2>
<h2>Downloadable XML data files (updated daily)</h2>
<h2>
                                    East Counties</h2>
<h2>
                                    East Midlands</h2>
<h2>
                                    London</h2>
<h2>
                                    North East</h2>
<h2>
                                    North West</h2>
<h2>
                                    South East</h2>
<h2>
                                    South West</h2>
<h2>
                                    West Midlands</h2>
<h2>
                                    Yorkshire and Humberside</h2>
<h2>
                                    Northern Ireland</h2>
<h2>
                                    Scotland</h2>
<h2>
                                    Wales</h2>
"""

如何跳过前四行并访问East Counties等文本字符串？

我的尝试不会跳过前四行并返回字符串包括代码中嵌入的许多空格（我想要删除）：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for h2 in soup.find_all('h2'):
    next
    next
    next
    next
    print (str(h2.children.next()))

期望的结果：

East Counties
East Midlands
London
North East
...

我做错了什么？

Answer 1

您可以在此使用slicing，因为find_all会返回列表类型，因此您可以使用它的索引，例如[4:]，并忽略空格使用{{ 1}}

strip()

Answer 2

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

for h2 in soup.find_all('h2')[4:]: # slicing to skip the first 4 elements
    print(h2.text.strip()) # get the inner text of the tag and then strip the white space

Python：在解析HTML代码时跳过行并去掉空格

2 个答案: