Question

我正在研究词法分析器。我有一个HTML文件。我想将文件中的每个字母转换成HTML标签中写入的大写字母。例如：

<html>
    <body>
       StackOverFlow
    </body>
</html>

这将转换为以下内容。

<html>
    <body>
       STACKOVERFLOW
    </body>
</html>

我只想知道正则表达式将选择HTML标记中的所有内容并对它们不执行任何操作。

仅考虑<和>内的简单HTML标记。

Answer 1

使用以下任一版本：

(?<=<)[^<]+(?=>)

说明： (?<=<) - 检查<之前是否有[^<]+（不要消耗）（1个或多个非开头括号））（消费），然后用>（不要消费）检查我们之后是否(?=>)。由于我们仅使用中间部分，因此我们将p代替<p>作为匹配。

或者，只是为了将括号与标记匹配：

<[^<]+>

说明： <匹配单个文字<，然后[^<]+匹配除<以外的1个或多个字符，然后匹配单个文字>。消耗掉所有字符，因此匹配将类似于<p>。

Answer 2

根据用户输入/受众群体的来源，您可能需要提高容差。虽然我讨厌没有引号的标签属性，但你会遇到这种情况。您还会在标记中遇到惰性括号，例如value="4 > 3"。

(?<=<)([\w-]+)((?:\s+[\w-]+\s*(?:=\s*(?:[^"'>\s]+|("|').*?\3))?)*)\s*(?=>)

或

<([\w-]+)((?:\s+[\w-]+\s*(?:=\s*(?:[^"'>\s]+|("|').*?\3))?)*)\s*>

第一个正则表达式的解释（第二个类似但实际上捕获括号而不是用外观观察它们）。

 (?<=                              # Opens LB
     <                             # Literal <
 )                                 # Closes LB
 (                                 # Opens CG1
     [\w-]+                        # Character class (any of the characters within)
                                     # Token: \w (a-z, A-Z, 0-9, _)
                                     # Any of: -
                                     # + repeats one or more times
 )                                 # Closes CG1
 (                                 # Opens CG2
     (?:                           # Opens NCG
         \s+                       # Token: \s (white space)
         [\w-]+                    # Character class (any of the characters within)
                                     # Token: \w (a-z, A-Z, 0-9, _)
                                     # Any of: -
         \s*                       # Token: \s (white space)
                                     # * repeats zero or more times
         (?:                       # Opens NCG
             =                     # Literal =
             \s*                   # Token: \s (white space)
             (?:                   # Opens NCG
                 [^"'>\s]+         # Negated Character class (excludes the characters within)
                                     # None of: "'>
                                     # Token: \s (white space)
             |                     # Alternation (NCG)
                 (                 # Opens CG3
                     "             # Literal "
                 |                 # Alternation (CG3)
                     '             # Literal '
                 )                 # Closes CG3
                 .*?               # . denotes any single character, except for newline
                                     # * repeats zero or more times
                                     # ? as few times as possible
                 \3                # A backreference to CG3
                                     # This is not a repeat of the match, not the pattern.
                                     # If this is an Octal Escape try padding with 0s like \003.
             )                     # Closes NCG
         )?                        # Closes NCG
                                     # ? repeats zero or one times
     )*                            # Closes NCG
 )                                 # Closes CG2
 \s*                               # Token: \s (white space)
 (?=                               # Opens LA
     >                             # Literal >
 )                                 # Closes LA

HTML标记的正则表达式

2 个答案: