使用Beautifulsoup,我想找到没有被标签包围的字符串,并用<p>
将其括起来,但我不知道怎么做。
<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure>
string,string,string<br/>
<p>string,string</p>
string
<p><a href="/test" target="_blank">string</a></p>
textexttext
<p>stringstring</p>
<p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>
我想改变上面的html
<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure>
<p>string,string,string</p><br/> <-here
<p>string,string</p>
<p>string</p> <- here
<p><a href="/test" target="_blank">string</a></p>
<p>textexttext</p> <- here
<p>stringstring</p>
<p><a href="tel:xxxxxxxx">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>
我尝试了以下方法,但它无法正常工作
from bs4 import BeautifulSoup as BS
html = """<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure>
<p>string,string,string</p><br/> <-here
<p>string,string</p>
<p>string</p> <- here
<p><a href="/test" target="_blank">string</a></p>
<p>textexttext</p> <- here
<p>stringstring</p>
<p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>"""
soup = BS(html, "html.parser")
while True:
text = soup.find(text=True)
if not text:
break
if not text.parent.name in ['p', 'span', 'a', 'div']:
text.wrap(content.new_tag("p"))
答案 0 :(得分:1)
尝试以下方法:
from bs4 import BeautifulSoup as bs
html = """<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure>
string,string,string<br/>
<p>string,string</p>
string
<p><a href="/test" target="_blank">string</a></p>
textexttext
<p>stringstring</p>
<p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>"""
soup = bs(html, "html.parser")
for text in soup.find_all(text=True):
if not text.parent.name in ['p', 'span', 'a', 'div'] and len(text.strip()):
text.wrap(soup.new_tag("p"))
print soup
这将为您提供以下输出:
<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure><p>
string,string,string</p><br/>
<p>string,string</p><p>
string
</p><p><a href="/test" target="_blank">string</a></p><p>
textexttext
</p><p>stringstring</p>
<p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>
答案 1 :(得分:0)
如果字符串是一行开头的alwanys,你可以这样做:
import re
text = """<p>string</p>
<figure class="image"> <img alt="" src="sample.jpg"/> </figure>
string,string,string<br/>
<p>string,string</p>
string
<p><a href="/test" target="_blank">string</a></p>
textexttext
<p>stringstring</p>
<p><a href="tel:xxxxxxxx" target="_blank">xxxxxxxxxx</a></p>
<div>textextext</div>
<p>string,string<br/>string</p>"""
r = re.compile(r"(?<=\n)([^\n<>\\]*)")
print(r.sub('<p>\g<1></p>', text))