Question

我需要过滤一组相当长（但非常规则）的.html文件，以便仅在文本元素中出现的情况下修饰一些构造体。

一个很好的例子是将<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p>更改为<p><div class="speech">it's hard to find his “good” side! He has <i>none</i>!<div></p>。

我可以轻松地用html.parser解析文件，但是尚不清楚如何生成结果文件，结果文件应尽可能与输入类似（无需重新格式化）。

我看过一个漂亮的汤，但是对于这个（据说吗？）简单的任务来说似乎真的太大了。

注意：我不需要需要/想要将.html文件提供给任何类型的浏览器；我只需要使用（稍有更改）内容更新（就地定位）。

更新：

遵循@soundstripe建议编写以下代码：

import bs4
from re import sub

def handle_html(html):
    sp = bs4.BeautifulSoup(html, features='html.parser')
    for e in list(sp.strings):
        s = sub(r'"([^"]+)"', r'&ldquo;\1&rdquo;', e)
        if s != e:
            e.replace_with(s)
    return str(sp).encode()

raw = b"""<p><div class="speech">it's hard to "find" his "good" side! He has <i>none</i>!<div></p>"""
new = handle_html(raw)
print(raw)
print(new)

不幸的是，BeautifulSoup试图从自身（和我自己）的利益中变得聪明：

b'<p><div class="speech">it\'s hard to "find" his "good" side! He has <i>none</i>!<div></p>'
b'<p><div class="speech">it\'s hard to &amp;ldquo;find&amp;rdquo; his &amp;ldquo;good&amp;rdquo; side! He has <i>none</i>!<div></div></div></p>'

即：它将普通的&转换为&，从而破坏了“实体（注意，我使用的是字节数组，而不是字符串。是否有用？）。

我该如何解决？

Answer 1

我不知道您为什么不使用BeautifulSoup。这是一个示例，可以按您的要求替换引号。

import re
import bs4

raw = b"""<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his &ldquo;good&rdquo; side! He has <i>none</i>!<div></p>"""
soup = bs4.BeautifulSoup(raw, features='html.parser')

def replace_quotes(s):
    return re.sub(r'"([^"]+)"', r'&ldquo;\1&rdquo;', e)


for e in list(soup.strings):
    # wrapping the new string in BeautifulSoup() call to correctly parse entities
    new_string = bs4.BeautifulSoup(replace_quotes(e))
    e.replace_with(new_string)

# use the soup.encode() formatter keyword to specify you want html entities in your output
new = soup.encode(formatter='html')


print(raw)
print(new)

python中的简单.html过滤器-仅修改文本元素

1 个答案: