python中的简单.html过滤器-仅修改文本元素

时间:2019-05-07 13:26:38

标签: python html filter

我需要过滤一组相当长(但非常规则)的.html文件,以便仅在文本元素中出现的情况下修饰一些构造体。

一个很好的例子是将<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p>更改为<p><div class="speech">it's hard to find his &ldquo;good&rdquo; side! He has <i>none</i>!<div></p>

我可以轻松地用html.parser解析文件,但是尚不清楚如何生成结果文件,结果文件应尽可能与输入类似(无需重新格式化)。

我看过一个漂亮的汤,但是对于这个(据说吗?)简单的任务来说似乎真的太大了。

注意:我不需要需要/想要将.html文件提供给任何类型的浏览器;我只需要使用(稍有更改)内容更新(就地定位)。

更新:

遵循@soundstripe建议编写以下代码:

import bs4
from re import sub

def handle_html(html):
    sp = bs4.BeautifulSoup(html, features='html.parser')
    for e in list(sp.strings):
        s = sub(r'"([^"]+)"', r'&ldquo;\1&rdquo;', e)
        if s != e:
            e.replace_with(s)
    return str(sp).encode()

raw = b"""<p><div class="speech">it's hard to "find" his "good" side! He has <i>none</i>!<div></p>"""
new = handle_html(raw)
print(raw)
print(new)

不幸的是,BeautifulSoup试图从自身(和我自己)的利益中变得聪明:

b'<p><div class="speech">it\'s hard to "find" his "good" side! He has <i>none</i>!<div></p>'
b'<p><div class="speech">it\'s hard to &amp;ldquo;find&amp;rdquo; his &amp;ldquo;good&amp;rdquo; side! He has <i>none</i>!<div></div></div></p>'

即:它将普通的&转换为&amp;,从而破坏了&ldquo;实体(注意,我使用的是字节数组,而不是字符串。是否有用?)。

我该如何解决?

1 个答案:

答案 0 :(得分:1)

我不知道您为什么不使用BeautifulSoup。这是一个示例,可以按您的要求替换引号。

import re
import bs4

raw = b"""<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his &ldquo;good&rdquo; side! He has <i>none</i>!<div></p>"""
soup = bs4.BeautifulSoup(raw, features='html.parser')

def replace_quotes(s):
    return re.sub(r'"([^"]+)"', r'&ldquo;\1&rdquo;', e)


for e in list(soup.strings):
    # wrapping the new string in BeautifulSoup() call to correctly parse entities
    new_string = bs4.BeautifulSoup(replace_quotes(e))
    e.replace_with(new_string)

# use the soup.encode() formatter keyword to specify you want html entities in your output
new = soup.encode(formatter='html')


print(raw)
print(new)