PHP RegEx:匹配html中的单词或句子<p>但不在</p> <div> <img/> <a> tags

时间:2016-11-25 20:42:07

标签: php html regex

I'm trying to match and replace about 100 words inside an html document creating links for each word. For performance reasons, I think DOM manipulation will be slower than preg_replace.

The thing is I want to be able to match (and replace) just simple words (or sentences)

INSIDE <p> tags BUT NOT inside any other tag <a> <div> nor <img>.

I'm using this regex expression to match the word "sapien":

/(<p[^>]*>)(.*)(?!<a\s[^>]+>[^<\/a>]+)(?!=\"[\w]*)(\bsapien\b)(?![^<\/a>]+<\/a>)(?![^\w]*\")(.*)(<\/p>)/imU

Here is the text where I'm applying it:

<p>Cras cursus consequat nibh <a href="#">sapien</a>ac vehicula. Sed erat sapien, condimentum quis risus nec, viverra dignissim nisi. Cras sapien convallis, erat egestas tincidunt <img src="myimage.jpg" alt="sapien" >rutrum, massa enim sagittis ante, sed pellentesque lorem risus vitae enim. Curabitur hendrerit dolor facilisis <a href="sapien">sapien</a> dolor malesuada molestie.</p>

I'm getting the match in

<a href="#">sapien</a> 

which is inside tag.

Any help will be much appreciated. Thanks.

2 个答案:

答案 0 :(得分:2)

解决方案只需一步,前瞻阴性:

preg_replace("#\b(sapien)\b(?![^<>]*(<\/a|<\/div|>))#i", "<a href='#'>\\1</a>", $input);

演示:http://ideone.com/Z74X0f

对于非固定宽度的环视模式,我们只能使用前瞻(lookbehind不会以这种方式工作),因此我们检查字符串后是否存在结束标记。

当前的regexp在示例文本上运行良好,但嵌套标签可能存在一些问题。例如,如果在结束标记之前将是任何其他标记,例如此处<div> sapien <img></div>,它也会将替换应用于该部分。

您可以通过向regexp添加额外的变体来避免这种情况:

\b(sapien)\b(?!([^<>]*(<img[^>]+>)[^<>]*|[^<>]*)(<\/a|<\/div|>))

演示:https://regex101.com/r/a5JiOo/2

答案 1 :(得分:1)

分割逻辑要容易得多,首先查找不受标记<a><div><img>影响的所有部分/部分,然后替换其中的单词/句子

我编写了php函数parse_text(),它每次解析文本并调用新的干净文本进行替换时,依次扫描文本并调用回调函数my_replace()

ideone.com工作演示以及上面的完整列表,我希望该解决方案可以为您提供帮助。

<?php
$input = <<<EOD
<p>sapien Cras cursus consequat nibh <a href="#">sapien</a>ac vehicula. Sed erat sapien, condimentum quis risus nec, viverra dignissim nisi. Cras sapien convallis, erat egestas tincidunt <img src="myimage.jpg" alt="sapien" >rutrum, massa enim sagittis ante, sapien.sed pellentesque lorem risus vitae enim. Curabitur hendrerit dolor facilisis <a href="sapien">sapien</a> dolor malesuada molestie.sapien</p>
EOD;

// define tags which you need to exclude from replacement 
// as: array( start_string => end_string, ... );
$ignore_tags = array(
    '<a' => '</a>',
    '<img' => '>',
    '<div' => '</div>'
);

echo "Input:\n {$input} \n\n ";
$output = parse_text($input, $ignore_tags);
echo "Output:\n {$output}";

// callback function that invokes every time when 'parse_text' parses 'clean' peace of text
function my_replace($text) {
    echo "my_replace call on: \n".$text."\n\n";

    // your replacements here
    $text = preg_replace("#\b(sapien)\b#i", "<a href=#>\\1</a>", $text);
    return $text;
}


// main parsing function that split text to clean and ignored parts
function parse_text($input, $ignore_tags) {
    $output = '';
    $str = '';
    $ignore = false;
    $current_ignore_tag = '';
    $ignore_tags_regexp = implode("|", array_keys($ignore_tags));

    for ($i = 0; $i < strlen($input); $i++) {
        $str .= $input[$i];
        // if starts ignore tag and not already $ignore
        if (preg_match("#({$ignore_tags_regexp})$#si", $str, $m) && !$ignore) {
            $str = preg_replace("#({$ignore_tags_regexp})$#si", "", $str); // cut and not include ignore tag
            $output .= my_replace($str) . $m[1]; // replace all before and save
            $ignore = true;
            $current_ignore_tag = $m[1];
            $str = '';
        } // if $ignore and matches the end of the current ignore tag
        elseif ($ignore && preg_match("#({$ignore_tags[$current_ignore_tag]})$#i", $str, $m)) {
            $output .= $str; // just save as it is current peace
            $ignore = false;
            $str = '';
        }
    }
    $output .= (!$ignore) ? my_replace($str) : $str;
    return $output;
}

结果:

Input:
 <p>sapien Cras cursus consequat nibh <a href="#">sapien</a>ac vehicula. Sed erat sapien, condimentum quis risus nec, viverra dignissim nisi. Cras sapien convallis, erat egestas tincidunt <img src="myimage.jpg" alt="sapien" >rutrum, massa enim sagittis ante, sapien.sed pellentesque lorem risus vitae enim. Curabitur hendrerit dolor facilisis <a href="sapien">sapien</a> dolor malesuada molestie.sapien</p> 

 my_replace call on: 
<p>sapien Cras cursus consequat nibh 

my_replace call on: 
ac vehicula. Sed erat sapien, condimentum quis risus nec, viverra dignissim nisi. Cras sapien convallis, erat egestas tincidunt 

my_replace call on: 
rutrum, massa enim sagittis ante, sapien.sed pellentesque lorem risus vitae enim. Curabitur hendrerit dolor facilisis 

my_replace call on: 
 dolor malesuada molestie.sapien</p>

Output:
 <p><a href=#>sapien</a> Cras cursus consequat nibh <a href="#">sapien</a>ac vehicula. Sed erat <a href=#>sapien</a>, condimentum quis risus nec, viverra dignissim nisi. Cras <a href=#>sapien</a> convallis, erat egestas tincidunt <img src="myimage.jpg" alt="sapien" >rutrum, massa enim sagittis ante, <a href=#>sapien</a>.sed pellentesque lorem risus vitae enim. Curabitur hendrerit dolor facilisis <a href="sapien">sapien</a> dolor malesuada molestie.<a href=#>sapien</a></p>