Question

我有一堆HTML页面，其中我想将CSS格式的文本片段转换为标准HTML标签。例如<span class="bold">some text</span>将成为<b>some text</b>

我坚持使用嵌套的span片段：

<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>

我想使用Python的正则表达式库转换片段。什么是正则表达式搜索的最佳策略 - 并且 - 替换上述输入？

Answer 1

我的解决方案使用lxml和cssselect以及一些Python：

#!/usr/bin/env python

import cssselect  # noqa
from lxml.html import fromstring


html = """
<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>
"""

class_to_style = {
    "underline": "u",
    "italic": "i",
    "bold": "b",
}

output = []
doc = fromstring(html)
spans = doc.cssselect("span")
for span in spans:
    if span.attrib.get("class"):
        output.append("<{0}>{1}</{0}>".format(class_to_style[span.attrib["class"]], span.text or ""))
print "".join(output)

输出：

<i></i><b>XXXXXXXX</b><i>some text</i><b>nested text</b><u>deep nested text</u>

NB：这是一个天真的解决方案，并且不会产生正确的输出，因为你必须保留一个开放标签的队列并在最后关闭它们。

嵌套字符串替换为Python中的正则表达式

1 个答案: