PHP正则表达式和preg_replace问题

时间:2013-03-19 23:27:01

标签: php regex

我正在查看其他人的旧代码并且无法理解它。

他有:

explode(' ', strtolower(preg_replace('/[^a-z0-9-]+/i', ' ', preg_replace('/\&#?[a-z0-9]{2,4}\;/', ' ', preg_replace('/<[^>]+>/', ' ', $texts)))));

我认为第一个正则表达式排除了a-z0-9,但我不确定第二个正则表达式的作用。第三个匹配除'< >'

之外的'>'内的任何内容

结果将输出一个包含$texts变量中每个单词的数组,但是,我只是不知道代码是如何产生的。我确实理解preg_replace和其他函数的作用,只是不知道过程是如何工作的

2 个答案:

答案 0 :(得分:4)

表达式/[^a-z0-9-]+/i将匹配(并随后用空格替换)除 a-z和0-9之外的任何字符。 [^中的^...]表示否定其中包含的字符集。

  • [^a-z0-9]匹配任何字母数字字符
  • +表示前面的一个或多个
  • /i使其不区分大小写

表达式/\&#?[a-z0-9]{2,4}\;/&匹配,后跟#,后跟两到四个字母和数字,以;结尾这将match HTML entities like &nbsp;&#39;

  • &#?匹配&&#,因为?使前面的#成为可选&实际上并不需要转义。< / LI>
  • [a-z0-9]{2,4}匹配两个和四个字母数字字符
  • ;是字面分号。它实际上并不需要转义。

正如您所怀疑的那样,最后一个会用空格替换<tagname><tagname attr='value'></tagname>等任何标记。请注意,它与整个标记匹配,而不仅仅是<>的内部内容。

  • <是文字字符
  • [^>]+是每个角色,但不包括下一个>
  • >是文字字符

我真的建议将其重写为对preg_replace()的三次单独调用而不是嵌套它们。

// Strips tags.  
// Would be better done with strip_tags()!!
$texts = preg_replace('/<[^>]+>/', ' ', $texts);
// Removes HTML entities
$texts = preg_replace('/&#?[a-z0-9]{2,4};/', ' ', $texts);
// Removes remainin non-alphanumerics
$texts = preg_replace('/[^a-z0-9-]+/i', ' ', $texts);
$array = explode(' ', $texts);

答案 1 :(得分:2)

此代码看起来像......

  1. 剥离HTML / XML标记(&lt;和&gt;之间的任何内容)
  2. 然后以&amp;开头的任何事情或&amp;#并且长度为2-4个字符(字母数字)
  3. 然后剥去任何非字母数字或短划线
  4. 处理嵌套的顺序

    /<[^>]+>/
    
    Match the character “<” literally «<»
    Match any character that is NOT a “>” «[^>]+»
       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
    Match the character “>” literally «>»
    
    
    /\&#?[a-z0-9]{2,4}\;/
    
    Match the character “&” literally «\&»
    Match the character “#” literally «#?»
       Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
    Match a single character present in the list below «[a-z0-9]{2,4}»
       Between 2 and 4 times, as many times as possible, giving back as needed (greedy) «{2,4}»
       A character in the range between “a” and “z” «a-z»
       A character in the range between “0” and “9” «0-9»
    Match the character “;” literally «\;»
    
    
    /[^a-z0-9-]+/i
    
    Options: case insensitive
    
    Match a single character NOT present in the list below «[^a-z0-9-]+»
       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
       A character in the range between “a” and “z” «a-z»
       A character in the range between “0” and “9” «0-9»
       The character “-” «-»