Question

我正在尝试摆脱HTML标签，在一定程度上可行，但并非所有标签都被删除了。但是下面提到的标记没有消失

print('NOT DEALT WITH:')
for body in not_dealt_with_list:
#p = re.compile(r'<.*?[\\t\\n\\r\\s]*?.*?>')
    print(remove_tags(body))
    #print(p.sub('', body))
    #body = re.sub()

def remove_tags(content):
parser = lxml.html.HTMLParser(remove_comments=True, 
remove_blank_text=True)
document = lxml.html.document_fromstring(content, parser)
return document.text_content()

Answer 1

您似乎要删除的内容已嵌入到html注释中（因为那里看起来不像html）。 HTML注释始于此，这就是您要搜索的内容。

尝试使用此正则表达式搜索注释中的所有内容，然后在多行中将其替换

<!--(.|\n)*?-->

让我知道它是如何工作的！

使用正则表达式删除html标签

1 个答案: