Question

我想删除带有正则表达式的HTML打开和关闭以及两个标签之间的内容。如何删除以下字符串中的<head>标签。

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

看起来像这样：

my_string = '''
<html>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

Answer 1

您可以使用head函数在Python中使用Beautiful Soup从HTML文本中删除decompose()标签。试试这个Python代码，

from bs4 import BeautifulSoup

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

soup = BeautifulSoup(my_string)
soup.find('head').decompose()  # find head tag and decompose/destroy it from the html
print(soup)                    # prints html text without head tag

打印

<html>

<meta/>
<p>
        this is a different paragraph tag
        </p>
</html>

此外，尽管不建议使用正则表达式方式，但是如果您要删除的标签未嵌套，则可以使用这些Python代码使用注释中提到的正则表达式将其删除。但是请始终避免使用正则表达式来解析嵌套结构，而要使用合适的解析器。

import re

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

print(re.sub(r'(?s)<head>.*?</head>', '', my_string))

打印以下内容，并注意(?s)的用法，因为您的html跨多行，因此启用点匹配换行符是必需的，

<html>

    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>

替换或删除HTML标签和内容Python正则表达式

1 个答案: