Question

我在这个例子中尝试做的是删除html标签，包括其中的所有内容;但是，我永远不知道标签是否会被格式化为<tag>...</tag>或仅仅是<tag ... />的情况。出于这个原因，我需要正则表达式作为'或'语句。

用简单的英语：

Replace '<tag' and everything between it and either the next '</tag>' or the next '/>'

这是我在Python中的尝试：

import re

html = '''
            <title>Test1</title>
        <link rel=\'dns-prefetch\' href=\'//www.test.com\' />
        <link rel=\'dns-prefetch\' href=\'//fonts.googleapis.com\' />
        <title>Test2</title>
        <link rel=\'dns-prefetch\' href=\'//code.ionicframework.com\' />
        <link rel=\'dns-prefetch\' href=\'//s.w.org\' />
        <link rel=\'dns-prefetch\' href=\'//code.ionicframework.com\' />
        <link rel=\'dns-prefetch\' href=\'//s.w.org\' />
'''

html = re.sub(r'\\n|\\r|\\t', '', html)
html = re.sub(r'<!--(.*?)-->', '[coMmEnT]', html)

def removeTag(html, label):
    html = re.sub(r'<'+label+'(.*?)</'+label+'>|/>', '~'+label+'~', html)
    return html

html = removeTag(html, 'title')
html = removeTag(html, 'link')

print(html)?)</link>|/>?', '[link]', html)

插入变量后，两个removeTags（）将是：

re.sub(r'<link(.*?)</link>|/>', '~link~', html)

re.sub(r'<title(.*?)</title>|/>', '~title~', html)

理想情况下，我的输出是：

~title~ ~link~ ~link~ ~title~ ~link~ ~link~ ~link~ ~link~

但它是：

~title~
<link rel='dns-prefetch' href='//www.test.com' ~title~
<link rel='dns-prefetch' href='//fonts.googleapis.com' ~title~
~title~
<link rel='dns-prefetch' href='//code.ionicframework.com' ~title~
<link rel='dns-prefetch' href='//s.w.org' ~title~
<link rel='dns-prefetch' href='//code.ionicframework.com' ~title~
<link rel='dns-prefetch' href='//s.w.org' ~title~

全新的正则表达式，任何指导都将非常感激

正则表达式中如何匹配x到y OR z（以先到者为准）？（html解析）

0 个答案:

正则表达式中如何匹配x到y OR z（以先到者为准）？ （html解析）

0 个答案:

正则表达式中如何匹配x到y OR z（以先到者为准）？（html解析）