我正在研究一种可以在html标签内翻译文本的转换器,并且我正在使用beautifulsoup,因为它是python中最好的html解析器之一。
这是文本,然后将其加载到汤中
In [95]: chalet.html
Out[95]: '<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>\r\n\r\n<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>\r\n\r\n<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>\r\n\r\n<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>\r\n\r\n<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>'
In [96]: html = soup(chalet.html)
In [97]: print(chalet.html)
<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>
<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>
接下来将其分解为内容,以便我可以对其进行解析
In [105]: html.contents
Out[105]:
[<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>,
'\n',
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>,
'\n',
<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>,
'\n',
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>,
'\n',
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>]
介于这一切之间的是新行,我可以用try and catch块忽略它们,但是获取字符串似乎也只能在某些行上起作用
In [107]: contents[0]
Out[107]: <h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>
In [108]: contents[0].string
Out[108]: '“Create a space I would be truly excited to stay in”.'
In [109]: contents[1]
Out[109]: '\n'
In [110]: contents[1].string
Out[110]: '\n'
In [111]: contents[2]
Out[111]: <h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>
In [112]: contents[2].string
如果您知道如何以不剥离标签的方式提取这些部分,那么replace
将适用于主字符串。
答案 0 :(得分:1)
使用.stripped_strings
属性从HTML中获取清晰的文本。
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings
from bs4 import BeautifulSoup
from pprint import pprint
html = '''
<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>
<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>
'''
soup = BeautifulSoup(html, 'html.parser')
texts = [*soup.stripped_strings]
pprint(texts)
输出:
['“Create a space I would be truly excited to stay in”.',
'That was the brief given to renowned architect, Herve Marullaz, after Chalet '
'Joux Plane’s owner secured a large plot of mountain land that backed onto a '
'stream and an alpine woodland. The result was Chalet',
'Belle Chéry.',
'Belle Chéry is a chalet built without constraint. A destination, to be '
...
获得一个长字符串:
long_string = ' '.join(texts)
输出:
“Create a space I would be truly excited to stay in”. That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet Belle C ...
答案 1 :(得分:0)
您可以使用列表组合和str.join
来加入内容列表,而无需使用换行符以获取所需的输出:
contents = ''.join([data for data in html.contents if data != '\n'])
现在,您可以制作汤了:
soup = BeautifulSoup(contents, 'lxml')
用您喜欢的解析器替换lxml
。