Question

我正在尝试删除给定字符串中的某些文本。因此问题如下。我有一个字符串。像这样说HTML代码。

<!DOCTYPE html>
<html>
  <head>
    <style>
    body {background-color: powderblue;}
    h1   {color: blue;}
    p    {color: red;}
    </style>
  </head>

  <body>

  <h1>This is a heading</h1>
  <p>This is a paragraph.</p>

  </body>
</html>

我希望代码删除所有与CSS相关的代码。即字符串现在应如下所示：

<!DOCTYPE html>
<html>
  <head>

  </head>
  <body>

  <h1>This is a heading</h1>
  <p>This is a paragraph.</p>

  </body>
</html>

我已经在python中使用此功能尝试过

def css_remover(text):
    m = re.findall('<style>(.*)</style>$', text,re.DOTALL)
    if m:
        for eachText in text.split(" "):
            for eachM in m:
                if eachM in  eachText:
                    text=text.replace(eachText,"")
                    print(text)

但这不起作用。我希望函数处理空格，换行符，以便删除<style> </style>标记之间的所有内容。另外，我希望标签上没有附加任何单词，它们不会受到影响。喜欢 hello<style> klasjdklasd </style>>应该产生hello>

Answer 1

您将invoices = Invoice.objects.all().prefetch_related('products')放在字符串的末尾。试试这个：

您可以查看this website，它有一个不错的正则表达式演示。

一些注意事项：我对CSS并不是很熟悉，因此如果嵌套了x = re.sub('<style>.*?</style>', '', text, flags=re.DOTALL) print(x)标签，可能是一个问题。

Answer 2

请特别注意RegExp表达式的?部分中的<style>(.*?)</style>字符，以免“太贪婪”。否则，在下面的示例中，它还会删除<title> HTML标记。

import re

text = """
<!DOCTYPE html>
<html>
  <head>
    <style>
    body {background-color: powderblue;}
    h1   {color: blue;}
    p    {color: red;}
    </style>
    <title>Test</title>
    <style>
    body {background-color: powderblue;}
    h1   {color: blue;}
    p    {color: red;}
    </style>
  </head>

  <body>

  <h1>This is a heading</h1>
  <p>This is a paragraph.</p>

  </body>
</html>
"""

regex = re.compile(r' *<style>(.*?)</style> *\n?', re.DOTALL|re.MULTILINE)
text = regex.sub('', text, 0)

print (text == """
<!DOCTYPE html>
<html>
  <head>
    <title>Test</title>
  </head>

  <body>

  <h1>This is a heading</h1>
  <p>This is a paragraph.</p>

  </body>
</html>
""")

如何从python中的给定字符串中删除两个子字符串之间的特定字符串？

2 个答案: