Question

我正在使用正则表达式解析html节点文本，寻找要执行操作的单词我正在使用(\w+)

我有类似的情况 word word并将其识别为单词。

我可以将html实体与\&[a-z0-9A-Z]+\;匹配，但如果它是实体的一部分，我不知道如何取消匹配。

有没有办法让正则表达式匹配一个单词，但如果它是一个像下面这样的html实体呢？

 
<＆lt;
ýý
等等

Answer 1

Example Fiddle:

只有在单词前面没有(?<!&#?)\b\w+或&时才匹配。但是，它不会检查分号，因为这可能合法地遵循正常的单词。

Answer 2

首先使用：

System.Web.HttpUtility.HtmlDecode(...)

或

System.Net.WebUtility.HtmlDecode(...)

在您的HTML上。

解码会将所有转义字符转换为正常显示。之后使用正则表达式解析已解码的HTML。

Answer 3

由于你正在使用C＃，你可以更进一步检查完整的实体形式。

这使用词边界的条件来检查
前排分叉。如果它在那里，它使用lookbehind来确保这不是一个实体。

 # @"(?i)(\w+)\b(?(?=;)(?<!(?:&|%)(?:[a-z]+|(?:\#(?:[0-9]+|x[0-9a-f]+)))(?=;)))"

 (?i)
 ( \w+ )                       # (1)
 \b 
 (?(?= ; )                     # Conditional. Is ';' the next character ? 
      (?<!                          # Yes, then this word cannot be part of an entity
           (?: & | % )
           (?:
                [a-z]+ 
             |  (?:
                     \#
                     (?:
                          [0-9]+ 
                       |  x [0-9a-f]+ 
                     )
                )
           )
           (?= ; )
      )
 )

代码：

string input = @"
&nbsp;
&lt; <
&#253; ý
etc etc
I have situations like word&nbsp;word and the nbsp gets recognized as a word.
";

Regex RxNonEntWords = new Regex(@"(?i)(\w+)\b(?(?=;)(?<!(?:&|%)(?:[a-z]+|(?:\#(?:[0-9]+|x[0-9a-f]+)))(?=;)))");
Match _m = RxNonEntWords.Match( input );
while (_m.Success)
{
    Console.WriteLine("Found: {1}", _m.Groups[1].Value);
    _m = _m.NextMatch();
}

正则表达式匹配单词而不是html实体

3 个答案: