提取主题标签 - 什么是更快的preg_match_all()或strpos()循环

时间:2014-04-09 17:27:09

标签: regex performance preg-match-all hashtag strpos

我已经编写了一个基本功能来提取主题标签' #hashtag'从字符串。该函数基于strpos()循环。

我认为有更多的优化潜力,如果你能展示如何让它更快,那将是非常好的。

与字母数字preg_match_all()相比,速度并不快。我已经在一个包含350000个标签的字符串上测试了它,并被一些文本包围。

结果:

350000 Tags extracted in 0.89877200126648 seconds // preg_match_all

350000 Tags extracted in 0.61978793144226 seconds // strpos loop

使用过的功能:

preg_match_all():
pattern  = '/(^|\s)\#\w+/';
preg_match_all( $pattern, $txt, $r, PREG_OFFSET_CAPTURE );

strpos() Loop:

const _MAX_NUM_HASHTAGS_    = 10;
const _MAX_NUM_USERADDS_    = 10;
const _MIN_LENGTH_HASHTAGS_ = 3;   // including '#'
const _MAX_LENGTH_HASHTAGS_ = 30;  // including '#'

...

$txt                = $this->_postText;
$hash_tags          = array();
$stop               = false;
$hash_tag_prefix    = '#';
$hash_tag_suffix    = ' ';
$hash_tag_preffixes = array( ' ', ',', '!', '.', '?', '-', '_' );
$i                  = 0;
$end_pos            = 0;

while ( false === $stop ) {

    $i++;
    #if( $i === self::_MAX_NUM_HASHTAGS_ ){ $stop = true; }

    // if the tag is not at the beginning of our text
    // we need to validate that the tag is not part of
    // a normal string like /linksource.de?tag=tag#anchor
    // or textline#notatag
    $start_pos = strpos( $txt, $hash_tag_prefix, $end_pos );

    if( false === $start_pos ){ 

        $stop = true; 

    }

    else{

        if( $start_pos !== 0 && ! in_array( $txt[$start_pos-1], $hash_tag_preffixes ) ){

            // not a tag
            // we use this start position
            // as offset position for the next run
            $end_pos = $start_pos+1;


        }

        else{

            // should be a tag

            $end_pos = strpos( $txt, $hash_tag_suffix, $start_pos );

            if( false === $end_pos ){ $end_pos = strlen( $txt ); }

            $tag_length      = $end_pos-$start_pos;
            $tag_length_true = $tag_length-2; 

            if( $tag_length_true < self::_MIN_LENGTH_HASHTAGS_ ){

                // tag is too short
                $hash_tags[] = 'Tag was too short ';

            }

            elseif( $tag_length_true > self::_MAX_LENGTH_HASHTAGS_ ){

                // tag is too long
                $hash_tags[] = 'Tag was too long ';

            }

            else{

                $hash_tags[] = rTrim( substr( $txt, $start_pos, $tag_length ), ', ' );

            }

        }  

    }

}

0 个答案:

没有答案