PCRE:同时懒惰和贪婪(占有量词)

时间:2010-02-26 20:49:50

标签: php preg-match-all pcre

我正在尝试在PHP上使用PCRE匹配一系列文本字符串,并且无法在第一个和第二个之间获得所有匹配。

如果有人想知道为什么在地球上我会想要这样做,那是因为Doc Comments。哦,我希望Zend如何使用native / plugin函数从PHP文件中读取Doc Comments ...

以下示例(普通)文本将用于解决此问题。它始终是纯PHP代码,文件开头只有一个开始标记,没有关闭。您可以假设语法始终是正确的。

<?php
  class someClass extends someExample
  {
    function doSomething($someArg = 'someValue')
    {
      // Nested code blocks...
      if($boolTest){}
    }
    private function killFurbies(){}
    protected function runSomething(){}
  }

  abstract
  class anotherClass
  {
    public function __construct(){}
    abstract function saveTheWhales();
  }

  function globalFunc(){}

问题

尝试匹配类中的所有方法;我的RegEx根本找不到方法killFurbies()。让它变得贪婪意味着它只匹配类中的最后一个方法,让它变得懒惰意味着它只匹配第一个方法。

$part = '.*';  // Greedy
$part = '.*?'; // Lazy

$regex = '%class(?:\\n|\\r|\\s)+([a-zA-Z_\\x7f-\\xff][a-zA-Z0-9_\\x7f-\\xff]*)'
       . '.*?\{' . $part .'(?:(public|protected|private)(?:\\n|\\r|\\s)+)?'
       . 'function(?:\\n|\\r|\\s)+([a-zA-Z_\\x7f-\\xff][a-zA-Z0-9_\\x7f-\\xff'
       . ']*)(?:\\n|\\r|\\s)*\\(%ms';

preg_match_all($regex, file_get_contents(__EXAMPLE__), $matches, PREG_SET_ORDER);
var_dump($matches);

结果:

// Lazy:
array(2) {
  [0]=>
  array(4) {
    [0]=>
    // Omitted.
    [1]=>
    string(9) "someClass"
    [2]=>
    string(0) ""
    [3]=>
    string(11) "doSomething"
  }
  [1]=>
  array(4) {
    [0]=>
    // Omitted.
    [1]=>
    string(12) "anotherClass"
    [2]=>
    string(6) "public"
    [3]=>
    string(11) "__construct"
  }
}

// Greedy:
array(2) {
  [0]=>
  array(4) {
    [0]=>
    // Omitted.
    [1]=>
    string(9) "someClass"
    [2]=>
    string(0) ""
    [3]=>
    string(13) "saveTheWhales"
  }
  [1]=>
  array(4) {
    [0]=>
    // Omitted.
    [1]=>
    string(12) "anotherClass"
    [2]=>
    string(0) ""
    [3]=>
    string(13) "saveTheWhales"
  }
}

我如何匹配所有? :S

任何帮助都会感激不尽,因为我已经觉得这个问题很荒谬,因为我正在输入它。试图回答这样一个问题的人比我更勇敢!

3 个答案:

答案 0 :(得分:0)

最好使用token_get_all获取PHP代码的tokens并迭代它们。可以使用T_DOC_COMMENT标识PHPDoc style comments令牌。

答案 1 :(得分:0)

错误,您不能只使用token_get_all解析源代码并查找T_DOC_COMMENT类型的令牌(从T_COMMENT更改为T_DOC_COMMENT,请参阅Gumnbo的帖子)?

可以找到如何使用此token_get_all功能的示例here

答案 2 :(得分:0)

解决方案

我想出了一个类来提取文件中的类和方法的Doc Comments。感谢所有回答此问题的人,以及其他on matching code blocks

以下示例的平均基准测试值介于0.00495和0.00505之间。

<?php

$file = 'path/to/libraries/tokenizer.php';
include $file;
$tokenizer = new Tokenizer;
// Start Benchmarking here.
$tokenizer->load($file);
// End Benchmarking here.
// The following will output 'bool(false)'.
var_dump($tokenizer->get_doc('Tokenizer', 'get_tokens'));
// The following will output 'string(18) "/** load method */"'.

Tokenizer(是的,我还没有想到更好的名字......)Class:

<?php

class Tokenizer
{

  private $compiled = false, $path = false, $tokens = false, $classes = array();

  /** load method */
  public function load($path)
  {
    $path = realpath($path);
    if(!file_exists($path) || !function_exists('token_get_all'))
    {
      return false;
    }
    $this->compiled = false;
    $this->classes = array();
    $this->path = $path;
    $this->tokens = false;

    $this->get_tokens();
    $this->get_classes();
    $this->class_blocks();
    $this->class_functions();
    return true;
  }

  protected function get_tokens()
  {
    $tokens = token_get_all(file_get_contents($this->path));
    $compiled = '';
    foreach($tokens as $k => $t)
    {
      if(is_array($t) && $t[0] != T_WHITESPACE)
      {
        $compiled .= $k . ':' . $t[0] . ',';
      }
      else
      {
        if($t == '{' || $t == '}')
        {
          $compiled .= $t . ',';
        }
      }
    }
    $this->tokens = $tokens;
    $this->compiled = trim($compiled, ',');
  }

  protected function get_classes()
  {
    if(!$this->compiled)
    {
      return false;
    }
    $regex = '%(?:(\\d+)\\:366,)?(?:\\d+\\:(?:345|344|353),)?\\d+\\:352,(\\d+)\\:307,(?:\\d+\\:(?:354|355),\\d+\\:307,)*{%';
    preg_match_all($regex, $this->compiled, $classes, PREG_SET_ORDER);
    if(is_array($classes))
    {
      foreach($classes as $class)
      {
        $this->classes[$this->tokens[$class[2]][1]] = array('token' => $class[2]);
        $this->classes[$this->tokens[$class[2]][1]]['doc'] = isset($this->tokens[$class[1]][1]) ? $this->tokens[$class[1]][1] : false;
      }
    }
  }

  private function class_blocks()
  {
    if(!$this->compiled)
    {
      return false;
    }
    foreach($this->classes as $class_name => $class)
    {
      $this->classes[$class_name]['block'] = $this->get_block($class['token']);
    }
  }

  protected function get_block($name_token)
  {
    if(!$this->compiled || ($pos = strpos($this->compiled, $name_token . ':')) === false)
    {
      return false;
    }
    $section= substr($this->compiled, $pos);
    $len = strlen($section);
    $block = '';
    $opening = 1;
    $closing = 0;
    for($i = 0; $i < $len; $i++)
    {
      if($section[$i] == '{')
      {
        $opening++;
      }
      elseif($section[$i] == '}')
      {
        $closing++;
        if($closing == $opening)
        {
          break;
        }
      }
      if($opening > 0)
      {
        $block .= $section[$i];
      }
    }
    return trim($block, ',');
  }

  protected function class_functions()
  {
    if(!$this->compiled)
    {
      return false;
    }
    foreach($this->classes as $class_name => $class)
    {
      $regex = '%(?:(\d+)\:366,)?(?:\d+\:(?:344|345),)?(?:\d+\:(?:341|342|343),)?\d+\:333,(\d+)\:307,\{%';
      preg_match_all($regex, $class['block'], $functions, PREG_SET_ORDER);
      foreach($functions as $function)
      {
        $function_name = $this->tokens[$function[2]][1];
        $this->classes[$class_name]['functions'][$function_name] = array('token' => $function[2]);
        $this->classes[$class_name]['functions'][$function_name]['doc'] = isset($this->tokens[$function[1]][1]) ? $this->tokens[$function[1]][1] : false;
        $this->classes[$class_name]['functions'][$function_name]['block'] = $this->get_block($function[2]);
      }
    }
  }

  public function get_doc($class, $function = false)
  {
    if(!is_string($class) || !isset($this->classes[$class]))
    {
      return false;
    }
    if(!is_string($function))
    {
      return $this->classes[$class]['doc'];
    }
    else
    {
      if(!isset($this->classes[$class]['functions'][$function]))
      {
        return false;
      }
      return $this->classes[$class]['functions'][$function]['doc'];
    }
  }

}

对此有何想法或评论?所有批评都欢迎!

谢谢,mniz。