I'm trying to match and replace about 100 words inside an html document creating links for each word. For performance reasons, I think DOM manipulation will be slower than preg_replace.
The thing is I want to be able to match (and replace) just simple words (or sentences)
INSIDE <p> tags BUT NOT inside any other tag <a> <div> nor <img>.
I'm using this regex expression to match the word "sapien":
/(<p[^>]*>)(.*)(?!<a\s[^>]+>[^<\/a>]+)(?!=\"[\w]*)(\bsapien\b)(?![^<\/a>]+<\/a>)(?![^\w]*\")(.*)(<\/p>)/imU
Here is the text where I'm applying it:
<p>Cras cursus consequat nibh <a href="#">sapien</a>ac vehicula. Sed erat sapien, condimentum quis risus nec, viverra dignissim nisi. Cras sapien convallis, erat egestas tincidunt <img src="myimage.jpg" alt="sapien" >rutrum, massa enim sagittis ante, sed pellentesque lorem risus vitae enim. Curabitur hendrerit dolor facilisis <a href="sapien">sapien</a> dolor malesuada molestie.</p>
I'm getting the match in
<a href="#">sapien</a>
which is inside tag.
Any help will be much appreciated. Thanks.
答案 0 :(得分:2)
解决方案只需一步,前瞻阴性:
preg_replace("#\b(sapien)\b(?![^<>]*(<\/a|<\/div|>))#i", "<a href='#'>\\1</a>", $input);
对于非固定宽度的环视模式,我们只能使用前瞻(lookbehind不会以这种方式工作),因此我们检查字符串后是否存在结束标记。
当前的regexp在示例文本上运行良好,但嵌套标签可能存在一些问题。例如,如果在结束标记之前将是任何其他标记,例如此处<div> sapien <img></div>
,它也会将替换应用于该部分。
您可以通过向regexp添加额外的变体来避免这种情况:
\b(sapien)\b(?!([^<>]*(<img[^>]+>)[^<>]*|[^<>]*)(<\/a|<\/div|>))
答案 1 :(得分:1)
分割逻辑要容易得多,首先查找不受标记<a>
,<div>
或<img>
影响的所有部分/部分,然后替换其中的单词/句子
我编写了php函数parse_text()
,它每次解析文本并调用新的干净文本进行替换时,依次扫描文本并调用回调函数my_replace()
。
在ideone.com工作演示以及上面的完整列表,我希望该解决方案可以为您提供帮助。
<?php
$input = <<<EOD
<p>sapien Cras cursus consequat nibh <a href="#">sapien</a>ac vehicula. Sed erat sapien, condimentum quis risus nec, viverra dignissim nisi. Cras sapien convallis, erat egestas tincidunt <img src="myimage.jpg" alt="sapien" >rutrum, massa enim sagittis ante, sapien.sed pellentesque lorem risus vitae enim. Curabitur hendrerit dolor facilisis <a href="sapien">sapien</a> dolor malesuada molestie.sapien</p>
EOD;
// define tags which you need to exclude from replacement
// as: array( start_string => end_string, ... );
$ignore_tags = array(
'<a' => '</a>',
'<img' => '>',
'<div' => '</div>'
);
echo "Input:\n {$input} \n\n ";
$output = parse_text($input, $ignore_tags);
echo "Output:\n {$output}";
// callback function that invokes every time when 'parse_text' parses 'clean' peace of text
function my_replace($text) {
echo "my_replace call on: \n".$text."\n\n";
// your replacements here
$text = preg_replace("#\b(sapien)\b#i", "<a href=#>\\1</a>", $text);
return $text;
}
// main parsing function that split text to clean and ignored parts
function parse_text($input, $ignore_tags) {
$output = '';
$str = '';
$ignore = false;
$current_ignore_tag = '';
$ignore_tags_regexp = implode("|", array_keys($ignore_tags));
for ($i = 0; $i < strlen($input); $i++) {
$str .= $input[$i];
// if starts ignore tag and not already $ignore
if (preg_match("#({$ignore_tags_regexp})$#si", $str, $m) && !$ignore) {
$str = preg_replace("#({$ignore_tags_regexp})$#si", "", $str); // cut and not include ignore tag
$output .= my_replace($str) . $m[1]; // replace all before and save
$ignore = true;
$current_ignore_tag = $m[1];
$str = '';
} // if $ignore and matches the end of the current ignore tag
elseif ($ignore && preg_match("#({$ignore_tags[$current_ignore_tag]})$#i", $str, $m)) {
$output .= $str; // just save as it is current peace
$ignore = false;
$str = '';
}
}
$output .= (!$ignore) ? my_replace($str) : $str;
return $output;
}
结果:
Input:
<p>sapien Cras cursus consequat nibh <a href="#">sapien</a>ac vehicula. Sed erat sapien, condimentum quis risus nec, viverra dignissim nisi. Cras sapien convallis, erat egestas tincidunt <img src="myimage.jpg" alt="sapien" >rutrum, massa enim sagittis ante, sapien.sed pellentesque lorem risus vitae enim. Curabitur hendrerit dolor facilisis <a href="sapien">sapien</a> dolor malesuada molestie.sapien</p>
my_replace call on:
<p>sapien Cras cursus consequat nibh
my_replace call on:
ac vehicula. Sed erat sapien, condimentum quis risus nec, viverra dignissim nisi. Cras sapien convallis, erat egestas tincidunt
my_replace call on:
rutrum, massa enim sagittis ante, sapien.sed pellentesque lorem risus vitae enim. Curabitur hendrerit dolor facilisis
my_replace call on:
dolor malesuada molestie.sapien</p>
Output:
<p><a href=#>sapien</a> Cras cursus consequat nibh <a href="#">sapien</a>ac vehicula. Sed erat <a href=#>sapien</a>, condimentum quis risus nec, viverra dignissim nisi. Cras <a href=#>sapien</a> convallis, erat egestas tincidunt <img src="myimage.jpg" alt="sapien" >rutrum, massa enim sagittis ante, <a href=#>sapien</a>.sed pellentesque lorem risus vitae enim. Curabitur hendrerit dolor facilisis <a href="sapien">sapien</a> dolor malesuada molestie.<a href=#>sapien</a></p>