Question

我期望在SO中找到这个......但到目前为止还没有

我在谈论一个查看HTML ENCODED字符串的正则表达式：例如

之类的东西

blip &#9830; trout&rsquo;s mouth

我是否已使用&\w+;和&#[0-9]+;覆盖了所有基础？

$encoded_string = htmlspecialchars($_GET["searchterms"]);
echo "<b>Search results for submitted string: \"$encoded_string\"</b><br><br>";
$html_special_chars_pattern = "!(&\\w+;|&#[0-9]+;)!";
$non_html_tokens = preg_split( $html_special_chars_pattern, $encoded_string, -1, PREG_SPLIT_DELIM_CAPTURE );

Answer 1

您错过了&#xH; or &#XH; numeric character references。

5.3.1数字字符引用

数字字符引用指定文档字符集中字符的代码位置。数字字符引用可以采用两种形式：



语法“＆amp; #D;”，其中D是十进制数，是指ISO 10646十进制字符数D.



语法“＆amp; #xH;”或“＆amp; #XH;”，其中H是十六进制数，是指ISO 10646十六进制字符数H.数字字符引用中的十六进制数字不区分大小写。

即正则表达式中的&#[xX][a-fA-F0-9]+;。

Answer 2

我已将my earlier related post作为答案。如果其他人提出了更好的解决方案或者为什么会破坏，请告诉我：）

preg_match_all('/&(?:[a-z]+|#\d+);/', $content, $matches);

也支持十六进制实体：

preg_match_all('/&(?:[a-z]+|#x?\d+);/i', $content, $matches);

顺便说一下，(?: ... )用于防止内存捕获。另见：What does `?` mean in this Perl regex?

php regex识别解码字符串中的所有HTML特殊字符

2 个答案: