beautifulsoup:在html标签中获取内部内容

时间:2019-07-17 11:00:10

标签: python beautifulsoup

我正在研究一种可以在html标签内翻译文本的转换器,并且我正在使用beautifulsoup,因为它是python中最好的html解析器之一。

这是文本,然后将其加载到汤中

In [95]: chalet.html                                                                                                                                                                       
Out[95]: '<h4><strong>&ldquo;Create a space I would be truly excited to stay in&rdquo;.</strong></h4>\r\n\r\n<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane&rsquo;s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Ch&eacute;ry.</strong></h4>\r\n\r\n<p>Belle Ch&eacute;ry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>\r\n\r\n<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children&rsquo;s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>\r\n\r\n<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children&rsquo;s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>'

In [96]: html = soup(chalet.html)                                                                                                                                                          

In [97]: print(chalet.html)                                                                                                                                                                
<h4><strong>&ldquo;Create a space I would be truly excited to stay in&rdquo;.</strong></h4>

<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane&rsquo;s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Ch&eacute;ry.</strong></h4>

<p>Belle Ch&eacute;ry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>

<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children&rsquo;s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>

<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children&rsquo;s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>

接下来将其分解为内容,以便我可以对其进行解析

In [105]: html.contents                                                                                                                                                                    
Out[105]: 
[<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>,
'\n',
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>,
'\n',
<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>,
'\n',
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>,
'\n',
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>]

介于这一切之间的是新行,我可以用try and catch块忽略它们,但是获取字符串似乎也只能在某些行上起作用

In [107]: contents[0]                                                                                                                                                                      
Out[107]: <h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>

In [108]: contents[0].string                                                                                                                                                               
Out[108]: '“Create a space I would be truly excited to stay in”.'

In [109]: contents[1]                                                                                                                                                                      
Out[109]: '\n'

In [110]: contents[1].string                                                                                                                                                               
Out[110]: '\n'

In [111]: contents[2]                                                                                                                                                                      
Out[111]: <h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>

In [112]: contents[2].string    

如果您知道如何以不剥离标签的方式提取这些部分,那么replace将适用于主字符串。

2 个答案:

答案 0 :(得分:1)

使用.stripped_strings属性从HTML中获取清晰的文本。

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings

from bs4 import BeautifulSoup
from pprint import pprint

html = '''
<h4><strong>&ldquo;Create a space I would be truly excited to stay in&rdquo;.</strong></h4>
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane&rsquo;s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Ch&eacute;ry.</strong></h4>
<p>Belle Ch&eacute;ry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children&rsquo;s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children&rsquo;s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>
'''
soup = BeautifulSoup(html, 'html.parser')
texts = [*soup.stripped_strings]
pprint(texts)

输出:

['“Create a space I would be truly excited to stay in”.',
 'That was the brief given to renowned architect, Herve Marullaz, after Chalet '
 'Joux Plane’s owner secured a large plot of mountain land that backed onto a '
 'stream and an alpine woodland. The result was Chalet',
 'Belle Chéry.',
 'Belle Chéry is a chalet built without constraint. A destination, to be '
...

获得一个长字符串:

long_string = ' '.join(texts)

输出:

“Create a space I would be truly excited to stay in”. That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet Belle C ...

答案 1 :(得分:0)

您可以使用列表组合和str.join来加入内容列表,而无需使用换行符以获取所需的输出:

contents = ''.join([data for data in html.contents if data != '\n'])

现在,您可以制作汤了:

soup = BeautifulSoup(contents, 'lxml')

用您喜欢的解析器替换lxml