Question

我正在尝试使用re.sub（）将所有HTML标记<和>更改为{和}。这里有个要注意的地方：我只想更改<table和</table>之间的匹配项。

我一生无法找到一个正则表达式教程或帖子，在该教程或帖子中可以更改每一个正则表达式匹配项，但只能更改其他两个正则表达式匹配项。我看过正向/负向前瞻和后向教程等，但是没有运气。在决定发布之前已经花了好几个小时。

这是我到目前为止所取得的最好成绩：

(?<=<table)(?:.*?)(<)(?:.*)(?=<\/table>)

这将在表的开始和结束标记之间匹配一个“ <”，但是我不知道如何匹配多个。我一直在努力使任何字符组变得懒惰或不懒惰，等等，但是没有运气。

这一切的要点是，我有一个包含大量HTML的字符串，并且希望将所有HTML标记以及表本身保留在表中。

我目前的计划是将表中的所有标记（以及表标记本身）更改为{或}，然后删除整个文档中的所有HTML标记<和>，然后将所有{和}重新更改为<和>。这样做应该保留表（以及其中的任何其他标签）。

输入示例：

<font style = "font-family:inherit>
<any other HTML tags>

random text

<table cellpadding="0" cellspacing="0" style="font-family:times new 
roman;font-size:10pt;width:100%;border-collapse:collapse;text-align:left;">
<tr>
<td colspan="3">
<font style="font-family:inherit;font-size:12pt;font- 
weight:bold;">washington, d.c. 20549</font>
random text
<any other HTML tags within table tags>
</td>
</table>

random text

<font style = "font-family:inherit>

输出示例：

<font style = "font-family:inherit>
<any other HTML tags>

random text

{table cellpadding="0" cellspacing="0" style="font-family:times new 
roman;font-size:10pt;width:100%;border-collapse:collapse;text-align:left;"}
{tr}
{td colspan="3"}
{font style="font-family:inherit;font-size:12pt;font- 
weight:bold;"}washington, d.c. 20549{/font}
random text
{any other HTML tags within table tags}
{/td}
{/table}

random text

<font style = "font-family:inherit>

谢谢你，熟食

Answer 1

不要对自己太苛刻。我不确定是否可以使用标准re sub一枪完成。实际上，我认为这不可能或非常复杂。例如，替换中的自定义功能（您可以在自定义功能中填充很多自定义功能，最多可以包含整个html解析器）

相反，我强烈建议您使用split / join拆分和重新组装一个简单的解决方案，或者，您可能会决定执行一系列的替换操作。

假设一个表l = s.split('table>'); l = [1]将为您提供表内容和l.split（。下面是多表版本

def curlyfy_el(s, tag='table'):

    return ('{%s' % tag).join(
                        [ ('{/%s}' % tag).join(
                                   [y if i != 0 else y.replace("<",  "{").replace(">", "}")
                                 for i, y in enumerate(x.split( '</%s>' % tag, 1)) 
    for x in s.split('<%s' % tag) ])

可读性更高

def curlyfy_el(s, tag='table'):
    h, *t = s.split('<%s' % tag)  # stplit on some pretable text and fragments starting with table
    r = [h]
    for x in t:
        head, *tail = x.split('</%s>' % tag, 1)  # select table body and rest, 1 is to keep duplicate closure of tag in one str
        head = head.replace("<", "{")
        head = head.replace(">", "}")
        r.append( ('{/%s}' % tag).join([head, *tail]))
    return ('{/%s}' % tag).join(r)

通常，为了最好地使用某些解析库（例如漂亮的汤）来处理HTML，在许多特殊情况下，临时代码都将失败。

Answer 2

正如Serge所述，这不是您要使用单个正则表达式解决的真正问题，而是具有多个正则表达式和一些python魔术：

def replacer(match):  # re.sub can take a function as the repl argument which gives you more flexibility
    choices = {'<':'{', '>':'}'}  # replace < with { and > with }
    return choices[match.group(0)]

result = []  # store the results here
for text in re.split(r'(?s)(?=<table)(.*)(?<=table>)', your_text): # split your text into table parts and non table parts
    if text.startswith('<table'): # if this is a table part, do the <> replacement 
        result.append(re.sub(r'[<>]', replacer, text))
    else: # otherwise leave it the same
        result.append(text)
print(''.join(result)) # join the list of strings to get the final result

查看有关将函数用于repl here的re.sub参数的文档

以及正则表达式的解释：

(?s)        # the . matches newlines 
(?=<table)  # positive look-ahead matching '<table'
(.*)        # matches everything between <table and table> (it is inclusive because of the look-ahead/behinds)   
(?<=table>) # positive look-behind matching 'table>'

还要注意，因为(.*)在捕获组中，所以它包含在re.split输出的字符串中（请参见here）

Answer 3

您可以使用以下正则表达式进行匹配，然后替换为Group 1：

[\s\S]*(<table[\s\S]*?</table>)[\s\S]*

这将匹配'<table'之前的任何内容，然后创建具有表内容的Group 1，然后匹配之后的所有内容。

替换为：

$1

那只会给您带有内容的表。

Python在一个正则表达式匹配项中找到多个正则表达式匹配项

3 个答案: