我需要过滤一组相当长(但非常规则)的.html文件,以便仅在文本元素中出现的情况下修饰一些构造体。
一个很好的例子是将<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p>
更改为<p><div class="speech">it's hard to find his “good” side! He has <i>none</i>!<div></p>
。
我可以轻松地用html.parser
解析文件,但是尚不清楚如何生成结果文件,结果文件应尽可能与输入类似(无需重新格式化)。
我看过一个漂亮的汤,但是对于这个(据说吗?)简单的任务来说似乎真的太大了。
注意:我不需要需要/想要将.html文件提供给任何类型的浏览器;我只需要使用(稍有更改)内容更新(就地定位)。
更新:
遵循@soundstripe建议编写以下代码:
import bs4
from re import sub
def handle_html(html):
sp = bs4.BeautifulSoup(html, features='html.parser')
for e in list(sp.strings):
s = sub(r'"([^"]+)"', r'“\1”', e)
if s != e:
e.replace_with(s)
return str(sp).encode()
raw = b"""<p><div class="speech">it's hard to "find" his "good" side! He has <i>none</i>!<div></p>"""
new = handle_html(raw)
print(raw)
print(new)
不幸的是,BeautifulSoup试图从自身(和我自己)的利益中变得聪明:
b'<p><div class="speech">it\'s hard to "find" his "good" side! He has <i>none</i>!<div></p>'
b'<p><div class="speech">it\'s hard to &ldquo;find&rdquo; his &ldquo;good&rdquo; side! He has <i>none</i>!<div></div></div></p>'
即:它将普通的&
转换为&
,从而破坏了“
实体(注意,我使用的是字节数组,而不是字符串。是否有用?)。
我该如何解决?
答案 0 :(得分:1)
我不知道您为什么不使用BeautifulSoup。这是一个示例,可以按您的要求替换引号。
import re
import bs4
raw = b"""<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his “good” side! He has <i>none</i>!<div></p>"""
soup = bs4.BeautifulSoup(raw, features='html.parser')
def replace_quotes(s):
return re.sub(r'"([^"]+)"', r'“\1”', e)
for e in list(soup.strings):
# wrapping the new string in BeautifulSoup() call to correctly parse entities
new_string = bs4.BeautifulSoup(replace_quotes(e))
e.replace_with(new_string)
# use the soup.encode() formatter keyword to specify you want html entities in your output
new = soup.encode(formatter='html')
print(raw)
print(new)