用BeautifulSoap输出中的空格替换<br/>

时间:2019-04-09 09:59:55

标签: python web-scraping beautifulsoup

我正在抓取BeautifulSoap的一些链接,但是,它似乎完全忽略了<br>标签。

这是我要删除的URL的源代码的相关部分:

<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span id="something">&#xe800;</span></h1>

这是我的BeautifulSoap代码(仅相关部分),用于获取h1标签内的文本:

    soup = BeautifulSoup(page, 'html.parser')
    title_box = soup.find('h1', attrs={'class': 'para-title'})
    title = title_box.text.strip()
    print title

这将提供以下输出:

    A quick brown fox jumps overthe lazy dog

我希望如此:

    A quick brown fox jumps over the lazy dog

如何在代码中将<br>替换为space

3 个答案:

答案 0 :(得分:3)

如何将.get_text()与分隔符参数一起使用?

from bs4 import BeautifulSoup

page = '''<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span>some stuff here</span></h1>'''


soup = BeautifulSoup(page, 'html.parser')
title_box = soup.find('h1', attrs={'class': 'para-title'})
title = title_box.get_text(separator=" ").strip()
print (title)   

输出:

print (title)
A quick brown fox jumps over the lazy dog
 some stuff here

答案 1 :(得分:2)

在解析之前在HTML上使用replace()

from bs4 import BeautifulSoup

html = '''<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span>some stuff here</span></h1>'''

html = html.replace("<br>", " ")
soup = BeautifulSoup(html, 'html.parser')
title_box = soup.find('h1', attrs={'class': 'para-title'})
title = title_box.get_text().strip()
print (title)

输出

A quick brown fox jumps over the lazy dog
some stuff here

编辑

对于以下注释中提到的OP部分;

html = '''<div class="description">Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>'''

from bs4 import BeautifulSoup

html = html.replace("\n", ". ")
soup = BeautifulSoup(html, 'html.parser')
div_box = soup.find('div', attrs={'class': 'description'})
divText= div_box.get_text().strip()
print (divText)

输出

Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four..

答案 2 :(得分:0)

使用str.replace函数:
print title.replace("<br>", " ")