Question

我有很多HTML文本，例如

text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub  in this text here and another one <sub> here /sub> .'

有时，<sub>，</sub>之类的HTML标签缺少其<括号。这会在以后的代码中导致困难。现在，我的问题是：如何才能智能地检测出那些缺失的支架并对其进行修复？

正确的文本为：

text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub>  in this text here and another one <sub> here </sub> .'

当然，我可以对所有可能的括号配置进行硬编码，但这会花费很长时间，因为我的文字中存在更多类似的错误。

text = re.sub( r'</sub ', r'</sub>', text) 
text = re.sub( r' /sub>', r'</sub>', text)

...并且先前的代码可能会添加另一个括号来更正示例。

Answer 1

尝试

text = 'Hello, how <sub> are </sub> you ? There is a <sub> small error </sub  in this text here and another one <sub> here /sub> .'

text_list = text.split();
for i, word in enumerate(text.split()):
    if 'sub' in word:
        if '<' != word[0]:
            word = '<' + word
        if '>' != word[-1]:
            word += '>'
        text_list[i] = word

result = ' '.join(text_list)
print(result)

输出将为

Hello, how <sub> are </sub> you ? There is a <sub> small error </sub> in this text here and another one <sub> here </sub> .

Answer 2

我将搜索类似sub.*?/sub的表达式。它根本不假设任何有关方括号的信息，但只会匹配与sub配对的/sub，从而降低了错误匹配的可能性。不允许使用量化词*?，以防止其与第一个sub和最后一个/sub相匹配：

将此与re.sub允许捕获组这一事实结合起来：

text = re.sub('<?sub>?(.*?)<?/sub>?', '<sub>\\1</sub>', text)

Answer 3

使用正则表达式；

import re
text = 'Hello, how <sub are </sub> you ? There is a <sub> small error </sub  in this text here and another one <sub> here /sub> .'

text = re.sub(r'<?[^/]sub>?', '<sub>', text)
text = re.sub(r'<?/sub>?', '</sub>', text)

print(text)

输出：

Hello, how <sub> are </sub> you ? There is a <sub> small error </sub>  in this text here and another one <sub> here </sub> .

编辑：工作原理，

re.sub(search pattern, replcement, string)将搜索字符串模式并替换为另一个

'<?[^/]sub>?'来解释这种模式的含义将其分开：

"<?"表示搜索文本“也许”包含“ <”，“？”手段存在或不存在

[^/]表示它不包含'/'

'sub'必须包含单词'sub'

'>?'可能包含“>”

Answer 4

好问题！这是一种不对单词sub进行硬编码并可以在任意标记上使用的解决方案，只要仅缺少一个括号并且HTML标记不包含任何属性（否则，我们如何知道何时应关闭标记？我们可以使用attr=""格式，但这很简单。另外，您的示例中显示的标记不必用空格分隔，这在HTML中并不常见。

代码

import re

def repair(text, backwards=False):
    left_bracket, right_bracket = "<", ">"

    if backwards:
        left_bracket, right_bracket = ">", "<"

    i = 0

    while i < len(text):
        if text[i] == left_bracket:
            j = i + 1

            while j < len(text) and re.match(r"[/\w]", text[j]):
                j += 1

                if backwards and text[j-1] == "/":
                    break

            if j >= len(text) or text[j] != right_bracket:
                text = text[:j] + right_bracket + text[j:]

            i = j

        i += 1

    return text

def repair_tags(html):
    return repair(repair(html[::-1], True)[::-1])

测试

if __name__ == "__main__":
    original = '''<li>
    <a>
        About Us
        <span>
            Learn more about Stack Overflow the company
        </span>
    </a>
</li>
<li>
    <a>
        Business
        <span>
            Learn more about hiring developers or posting ads with us
        </span>
    </a>
</li>'''
    corrupted = '''li>
    <a
        About Us
        span>
            Learn more about Stack Overflow the company
        </span
    </a
/li>
<li
    <a
        Business
        span>
            Learn more about hiring developers or posting ads with us
        /span>
    </a
</li'''

    print(repair_tags(corrupted))
    print("repaired matches original?", repair_tags(corrupted) == original)

输出

<li>
    <a>
        About Us
        <span>
            Learn more about Stack Overflow the company
        </span>
    </a>
</li>
<li>
    <a>
        Business
        <span>
            Learn more about hiring developers or posting ads with us
        </span>
    </a>
</li>
repaired matches original? True

工作原理

遍历字符串以查找方括号字符。找到一个后，向前移动直到命中字符串的末尾或遇到非单词字符。如果搜索到达字符串末尾或当前的非单词字符不是正确的伴侣括号，请放置伴侣括号。

然后，对反向的字符串执行相同的操作，切换目标括号，并进行一点调整以在寻找结束标签位置时在/处中断。

由于字符串的建立，时间复杂度不是很高。毫无疑问，这里有一个简单的正则表达式，因此可以以此作为概念证明。

Try it!

使用Python修复HTML标签括号

4 个答案:

代码

测试

输出

工作原理