Python:在解析HTML代码时跳过行并去掉空格

时间:2017-05-11 09:22:27

标签: python html string beautifulsoup html-parsing

我有以下html代码:

html_doc = """
<h2> API guidance for developers</h2>
<h2>Images</h2>
<h2>Score descriptors</h2>
<h2>Downloadable XML data files (updated daily)</h2>
<h2>
                                    East Counties</h2>
<h2>
                                    East Midlands</h2>
<h2>
                                    London</h2>
<h2>
                                    North East</h2>
<h2>
                                    North West</h2>
<h2>
                                    South East</h2>
<h2>
                                    South West</h2>
<h2>
                                    West Midlands</h2>
<h2>
                                    Yorkshire and Humberside</h2>
<h2>
                                    Northern Ireland</h2>
<h2>
                                    Scotland</h2>
<h2>
                                    Wales</h2>
"""

如何跳过前四行并访问East Counties等文本字符串?

我的尝试不会跳过前四行并返回字符串包括代码中嵌入的许多空格(我想要删除):

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for h2 in soup.find_all('h2'):
    next
    next
    next
    next
    print (str(h2.children.next()))

期望的结果:

East Counties
East Midlands
London
North East
...

我做错了什么?

2 个答案:

答案 0 :(得分:4)

您可以在此使用slicing,因为find_all会返回列表类型,因此您可以使用它的索引,例如[4:],并忽略空格使用{{ 1}}

strip()

答案 1 :(得分:2)

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

for h2 in soup.find_all('h2')[4:]: # slicing to skip the first 4 elements
    print(h2.text.strip()) # get the inner text of the tag and then strip the white space