我已经编写了一个基本功能来提取主题标签' #hashtag'从字符串。该函数基于strpos()循环。
我认为有更多的优化潜力,如果你能展示如何让它更快,那将是非常好的。
与字母数字preg_match_all()相比,速度并不快。我已经在一个包含350000个标签的字符串上测试了它,并被一些文本包围。
结果:
350000 Tags extracted in 0.89877200126648 seconds // preg_match_all
350000 Tags extracted in 0.61978793144226 seconds // strpos loop
使用过的功能:
preg_match_all():
pattern = '/(^|\s)\#\w+/';
preg_match_all( $pattern, $txt, $r, PREG_OFFSET_CAPTURE );
strpos() Loop:
const _MAX_NUM_HASHTAGS_ = 10;
const _MAX_NUM_USERADDS_ = 10;
const _MIN_LENGTH_HASHTAGS_ = 3; // including '#'
const _MAX_LENGTH_HASHTAGS_ = 30; // including '#'
...
$txt = $this->_postText;
$hash_tags = array();
$stop = false;
$hash_tag_prefix = '#';
$hash_tag_suffix = ' ';
$hash_tag_preffixes = array( ' ', ',', '!', '.', '?', '-', '_' );
$i = 0;
$end_pos = 0;
while ( false === $stop ) {
$i++;
#if( $i === self::_MAX_NUM_HASHTAGS_ ){ $stop = true; }
// if the tag is not at the beginning of our text
// we need to validate that the tag is not part of
// a normal string like /linksource.de?tag=tag#anchor
// or textline#notatag
$start_pos = strpos( $txt, $hash_tag_prefix, $end_pos );
if( false === $start_pos ){
$stop = true;
}
else{
if( $start_pos !== 0 && ! in_array( $txt[$start_pos-1], $hash_tag_preffixes ) ){
// not a tag
// we use this start position
// as offset position for the next run
$end_pos = $start_pos+1;
}
else{
// should be a tag
$end_pos = strpos( $txt, $hash_tag_suffix, $start_pos );
if( false === $end_pos ){ $end_pos = strlen( $txt ); }
$tag_length = $end_pos-$start_pos;
$tag_length_true = $tag_length-2;
if( $tag_length_true < self::_MIN_LENGTH_HASHTAGS_ ){
// tag is too short
$hash_tags[] = 'Tag was too short ';
}
elseif( $tag_length_true > self::_MAX_LENGTH_HASHTAGS_ ){
// tag is too long
$hash_tags[] = 'Tag was too long ';
}
else{
$hash_tags[] = rTrim( substr( $txt, $start_pos, $tag_length ), ', ' );
}
}
}
}