如何查找未包含在标记中的字符串

时间:2017-05-26 13:33:17

标签: python html beautifulsoup

使用Beautifulsoup,我想找到没有被标签包围的字符串,并用<p>将其括起来,但我不知道怎么做。

<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure>
string,string,string<br/>
<p>string,string</p>
string
<p><a href="/test" target="_blank">string</a></p>
textexttext
<p>stringstring</p>
<p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>

我想改变上面的html

<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure>
<p>string,string,string</p><br/>       <-here
<p>string,string</p>
<p>string</p>                     <- here
<p><a href="/test" target="_blank">string</a></p>
<p>textexttext</p>               <- here
<p>stringstring</p>
<p><a href="tel:xxxxxxxx">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>

我尝试了以下方法,但它无法正常工作

from bs4 import BeautifulSoup as BS

html = """<p>string</p>
    <figure class="image"> <img alt="" src="sample.jpg"/> </figure>
    <p>string,string,string</p><br/> <-here
    <p>string,string</p>
    <p>string</p> <- here
    <p><a href="/test" target="_blank">string</a></p>
    <p>textexttext</p>    <- here
    <p>stringstring</p>
    <p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
    <div>textextext</div>
    <p>string,string<br/>string</p>"""

soup = BS(html, "html.parser")

while True:
    text = soup.find(text=True)
    if not text:
        break
    if not text.parent.name in ['p', 'span', 'a', 'div']:
        text.wrap(content.new_tag("p"))

2 个答案:

答案 0 :(得分:1)

尝试以下方法:

from bs4 import BeautifulSoup as bs

html = """<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure>
string,string,string<br/>
<p>string,string</p>
string
<p><a href="/test" target="_blank">string</a></p>
textexttext
<p>stringstring</p>
<p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>"""

soup = bs(html, "html.parser")

for text in soup.find_all(text=True):
    if not text.parent.name in ['p', 'span', 'a', 'div'] and len(text.strip()):
        text.wrap(soup.new_tag("p"))

print soup    

这将为您提供以下输出:

<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure><p>
string,string,string</p><br/>
<p>string,string</p><p>
string
</p><p><a href="/test" target="_blank">string</a></p><p>
textexttext
</p><p>stringstring</p>
<p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>    

答案 1 :(得分:0)

如果字符串是一行开头的alwanys,你可以这样做:

import re
text = """<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure>
string,string,string<br/>
<p>string,string</p>
string
<p><a href="/test" target="_blank">string</a></p>
textexttext
<p>stringstring</p>
<p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>"""
r = re.compile(r"(?<=\n)([^\n<>\\]*)")
print(r.sub('<p>\g<1></p>', text))