我想转换
从此
<b><i><u>Charming boutique selling trendy casual &amp; dressy apparel for women, some plus sized items, swimwear, shoes &amp; jewelry.</u></i></b>
对此
Charming boutique selling trendy casual dressy apparel for women, some plus sized items, swimwear, shoes jewelry.
我很困惑如何删除特殊字符以及特殊字符之间的一些字母。有人可以建议一种方法吗?
答案 0 :(得分:2)
尝试以下操作:
import re
string = '<b><i><u>Charming boutique selling trendy casual &amp; dressy apparel for women, some plus sized items, swimwear, shoes &amp; jewelry.</u></i></b>'
string = re.sub('</?[a-z]+>', '', string)
string = string.replace('&amp;', '&')
print(string) # prints 'Charming boutique selling trendy casual & dressy apparel for women, some plus sized items, swimwear, shoes & jewelry.'
您要更改的字符串看起来好像是HTML,已经被转义了几次,所以我的解决方案仅适用于这种情况。
我使用正则表达式将标签替换为空字符串,并使用文字&
替换了转义符以代替&符号。
希望这是您想要的,如果有任何麻烦,请告诉我。
答案 1 :(得分:2)
您可以使用html
模块和BeautifulSoup
来获取没有转义标签的文本:
s = "<b><i><u>Charming boutique selling trendy casual &amp; dressy apparel for women, some plus sized items, swimwear, shoes &amp; jewelry.</u></i></b>"
from bs4 import BeautifulSoup
from html import unescape
soup = BeautifulSoup(unescape(s), 'lxml')
print(soup.text)
打印:
Charming boutique selling trendy casual & dressy apparel for women, some plus sized items, swimwear, shoes & jewelry.