Question

如何编写正则表达式以从文本中删除已编码和未编码的单词。

例如，我们假设以下内容：

$string1 = 'do not enter your username';
//The encoded string below is: 'or password';
$string2 = '&#111;&#114 &#112;&#97;&#115;&#115;&#119;&#111;&#114;&#100;';
$string = $string1 . $string2;

正则表达式应删除未编码的单词“username”和编码后的单词“或password”，编码后如下所示：

&#111;&#114 &#112;&#97;&#115;&#115;&#119;&#111;&#114;&#100;

我编写了以下正则表达式，它适用于未编码的单词，但在编码时失败。

$words_to_remove = 'username|or password';
preg_replace("/\b($words_to_remove)\b/u",  ' ',  $string);

Answer 1

更确切地说，此'o&#114 password'是 数字HTML编码 ，应以更复杂的方式进行解码。
此外，编码字符串o&#114 <---中有一个拼写错误：r是r字符的等价物，每个字符都是＆＃34;序列＆＃34;应以分号;结尾。使用html_entity_decode函数的最终解决方案应如下所示：

$string1 = 'do not enter your username ';
$string2 = '&#111;&#114; &#112;&#97;&#115;&#115;&#119;&#111;&#114;&#100;';
$string = html_entity_decode($string1 . $string2);

$words_to_remove = 'username|password';
$string = preg_replace("/($words_to_remove)/u",  ' ',  $string);

print_r($string);

输出：

do not enter your   or

如何编写正则表达式以删除已编码和未编码的单词

1 个答案: