我正在查看其他人的旧代码并且无法理解它。
他有:
explode(' ', strtolower(preg_replace('/[^a-z0-9-]+/i', ' ', preg_replace('/\&#?[a-z0-9]{2,4}\;/', ' ', preg_replace('/<[^>]+>/', ' ', $texts)))));
我认为第一个正则表达式排除了a-z
和0-9
,但我不确定第二个正则表达式的作用。第三个匹配除'< >'
'>'
内的任何内容
结果将输出一个包含$texts
变量中每个单词的数组,但是,我只是不知道代码是如何产生的。我确实理解preg_replace
和其他函数的作用,只是不知道过程是如何工作的
答案 0 :(得分:4)
表达式/[^a-z0-9-]+/i
将匹配(并随后用空格替换)除 a-z和0-9之外的任何字符。 [^
中的^...]
表示否定其中包含的字符集。
[^a-z0-9]
匹配任何非字母数字字符+
表示前面的一个或多个/i
使其不区分大小写表达式/\&#?[a-z0-9]{2,4}\;/
与&
匹配,后跟#
,后跟两到四个字母和数字,以;
结尾这将match HTML entities like
或'
&#?
匹配&
或&#
,因为?
使前面的#
成为可选&
实际上并不需要转义。< / LI>
[a-z0-9]{2,4}
匹配两个和四个字母数字字符;
是字面分号。它实际上并不需要转义。正如您所怀疑的那样,最后一个会用空格替换<tagname>
或<tagname attr='value'>
或</tagname>
等任何标记。请注意,它与整个标记匹配,而不仅仅是<>
的内部内容。
<
是文字字符[^>]+
是每个角色,但不包括下一个>
>
是文字字符我真的建议将其重写为对preg_replace()
的三次单独调用而不是嵌套它们。
// Strips tags.
// Would be better done with strip_tags()!!
$texts = preg_replace('/<[^>]+>/', ' ', $texts);
// Removes HTML entities
$texts = preg_replace('/&#?[a-z0-9]{2,4};/', ' ', $texts);
// Removes remainin non-alphanumerics
$texts = preg_replace('/[^a-z0-9-]+/i', ' ', $texts);
$array = explode(' ', $texts);
答案 1 :(得分:2)
此代码看起来像......
处理嵌套的顺序
/<[^>]+>/
Match the character “<” literally «<»
Match any character that is NOT a “>” «[^>]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “>” literally «>»
/\&#?[a-z0-9]{2,4}\;/
Match the character “&” literally «\&»
Match the character “#” literally «#?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single character present in the list below «[a-z0-9]{2,4}»
Between 2 and 4 times, as many times as possible, giving back as needed (greedy) «{2,4}»
A character in the range between “a” and “z” «a-z»
A character in the range between “0” and “9” «0-9»
Match the character “;” literally «\;»
/[^a-z0-9-]+/i
Options: case insensitive
Match a single character NOT present in the list below «[^a-z0-9-]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A character in the range between “a” and “z” «a-z»
A character in the range between “0” and “9” «0-9»
The character “-” «-»