Question

使用Beautifulsoup，我想找到没有被标签包围的字符串，并用<p>将其括起来，但我不知道怎么做。

<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure>
string,string,string<br/>
<p>string,string</p>
string
<p><a href="/test" target="_blank">string</a></p>
textexttext
<p>stringstring</p>
<p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>

我想改变上面的html

<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure>
<p>string,string,string</p><br/>       <-here
<p>string,string</p>
<p>string</p>                     <- here
<p><a href="/test" target="_blank">string</a></p>
<p>textexttext</p>               <- here
<p>stringstring</p>
<p><a href="tel:xxxxxxxx">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>

我尝试了以下方法，但它无法正常工作

from bs4 import BeautifulSoup as BS

html = """<p>string</p>
    <figure class="image"> <img alt="" src="sample.jpg"/> </figure>
    <p>string,string,string</p><br/> <-here
    <p>string,string</p>
    <p>string</p> <- here
    <p><a href="/test" target="_blank">string</a></p>
    <p>textexttext</p>    <- here
    <p>stringstring</p>
    <p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
    <div>textextext</div>
    <p>string,string<br/>string</p>"""

soup = BS(html, "html.parser")

while True:
    text = soup.find(text=True)
    if not text:
        break
    if not text.parent.name in ['p', 'span', 'a', 'div']:
        text.wrap(content.new_tag("p"))

Answer 1

尝试以下方法：

from bs4 import BeautifulSoup as bs

html = """<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure>
string,string,string<br/>
<p>string,string</p>
string
<p><a href="/test" target="_blank">string</a></p>
textexttext
<p>stringstring</p>
<p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>"""

soup = bs(html, "html.parser")

for text in soup.find_all(text=True):
    if not text.parent.name in ['p', 'span', 'a', 'div'] and len(text.strip()):
        text.wrap(soup.new_tag("p"))

print soup

这将为您提供以下输出：

<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure><p>
string,string,string</p><br/>
<p>string,string</p><p>
string
</p><p><a href="/test" target="_blank">string</a></p><p>
textexttext
</p><p>stringstring</p>
<p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>

Answer 2

如果字符串是一行开头的alwanys，你可以这样做：

import re
text = """<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure>
string,string,string<br/>
<p>string,string</p>
string
<p><a href="/test" target="_blank">string</a></p>
textexttext
<p>stringstring</p>
<p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>"""
r = re.compile(r"(?<=\n)([^\n<>\\]*)")
print(r.sub('<p>\g<1></p>', text))

如何查找未包含在标记中的字符串

2 个答案: