如何替换HTML标记中的文本URL和排除URL?

时间:2010-10-23 07:57:51

标签: php html regex url

我需要你的帮助。

我想转此:

sometext sometext http://www.somedomain.com/index.html sometext sometext

成:

sometext sometext <a href="http://somedoamai.com/index.html">www.somedomain.com/index.html</a> sometext sometext

我使用此正则表达式管理它:

preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $text);

问题是它还替换了img网址,例如:

sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext

变成了:

sometext sometext <img src="<a href="http//domain.com/image.jpg">domain.com/image.jpg</a>"> sometext sometext

请帮忙。

7 个答案:

答案 0 :(得分:7)

Gumbo的简化版本:

$html = <<< HTML
<html>
<body>
<p>
    This is a text with a <a href="http://example.com/1">link</a>
    and another <a href="http://example.com/2">http://example.com/2</a>
    and also another http://example.com with the latter being the
    only one that should be replaced. There is also images in this
    text, like <img src="http://example.com/foo"/> but these should
    not be replaced either. In fact, only URLs in text that is no
    a descendant of an anchor element should be converted to a link.
</p>
</body>
</html>
HTML;

让我们使用一个XPath,它只获取那些实际上是包含http://或https://或ftp://的文本节点的元素,而这些元素本身并不是锚元素的文本节点。

$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$texts = $xPath->query(
    '/html/body//text()[
        not(ancestor::a) and (
        contains(.,"http://") or
        contains(.,"https://") or
        contains(.,"ftp://") )]'
);

上面的XPath将为我们提供一个包含以下数据的TextNode:

 and also another http://example.com with the latter being the
    only one that should be replaced. There is also images in this
    text, like 

自PHP5.3起,我们也可以use PHP inside the XPath使用正则表达式模式来选择我们的节点,而不是三次调用contains。

我们不是以符合标准的方式拆分文本节点,而是使用document fragment,只需用片段替换整个textnode。在这种情况下,非标准仅表示the method we will be using for this不属于W3C specification of the DOM API

foreach ($texts as $text) {
    $fragment = $dom->createDocumentFragment();
    $fragment->appendXML(
        preg_replace(
            "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i",
            '<a href="$1">$1</a>',
            $text->data
        )
    );
    $text->parentNode->replaceChild($fragment, $text);
}
echo $dom->saveXML($dom->documentElement);

然后输出:

<html><body>
<p>
    This is a text with a <a href="http://example.com/1">link</a>
    and another <a href="http://example.com/2">http://example.com/2</a>
    and also another <a href="http://example.com">http://example.com</a> with the latter being the
    only one that should be replaced. There is also images in this
    text, like <img src="http://example.com/foo"/> but these should
    not be replaced either. In fact, only URLs in text that is no
    a descendant of an anchor element should be converted to a link.
</p>
</body></html>

答案 1 :(得分:4)

你不应该使用正则表达式 - 至少不是正则表达式。使用正确的HTML DOM解析器,例如PHP’s DOM library。然后,您可以迭代节点,检查它是否是文本节点并执行正则表达式搜索并适当地替换文本节点。

这样的事情应该这样做:

$pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i";
$doc = new DOMDocument();
$doc->loadHTML($str);
// for every element in the document
foreach ($doc->getElementsByTagName('*') as $elem) {
    // for every child node in each element
    foreach ($elem->childNodes as $node) {
        if ($node->nodeType === XML_TEXT_NODE) {
            // split the text content to get an array of 1+2*n elements for n URLs in it
            $parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
            $n = count($parts);
            if ($n > 1) {
                $parentNode = $node->parentNode;
                // insert for each pair of non-URL/URL parts one DOMText and DOMElement node before the original DOMText node
                for ($i=1; $i<$n; $i+=2) {
                    $a = $doc->createElement('a');
                    $a->setAttribute('href', $parts[$i]);
                    $a->setAttribute('target', '_blank');
                    $a->appendChild($doc->createTextNode($parts[$i]));
                    $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
                    $parentNode->insertBefore($a, $node);
                }
                // insert the last part before the original DOMText node
                $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
                // remove the original DOMText node
                $node->parentNode->removeChild($node);
            }
        }
    }
}

好的,由于DOMNodeList‍sgetElementsByTagNamechildNodeslive,因此DOM中的每个更改都会反映到该列表中,因此您无法使用{{1}这也将迭代新添加的节点。相反,您需要使用foreach循环来跟踪添加的元素以增加索引指针,并且最好适当地预先计算出数组边界。

但是因为在这种某种复杂的算法中你很难(对于三个for循环中的每一个都需要一个索引指针和数组边界),使用递归算法会更方便:

for

此处function mapOntoTextNodes(DOMNode $node, $callback) { if ($node->nodeType === XML_TEXT_NODE) { return $callback($node); } for ($i=0, $n=count($node->childNodes); $i<$n; ++$i) { $nodesChanged = 0; switch ($node->childNodes->item($i)->nodeType) { case XML_ELEMENT_NODE: $nodesChanged = mapOntoTextNodes($node->childNodes->item($i), $callback); break; case XML_TEXT_NODE: $nodesChanged = $callback($node->childNodes->item($i)); break; } if ($nodesChanged !== 0) { $n += $nodesChanged; $i += $nodesChanged; } } } function foo(DOMText $node) { $pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i"; $parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE); $n = count($parts); if ($n > 1) { $parentNode = $node->parentNode; $doc = $node->ownerDocument; for ($i=1; $i<$n; $i+=2) { $a = $doc->createElement('a'); $a->setAttribute('href', $parts[$i]); $a->setAttribute('target', '_blank'); $a->appendChild($doc->createTextNode($parts[$i])); $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node); $parentNode->insertBefore($a, $node); } $parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node); $parentNode->removeChild($node); } return $n-1; } $str = '<div>sometext http://www.somedomain.com/index.html sometext <img src="http//domain.com/image.jpg"> sometext sometext</div>'; $doc = new DOMDocument(); $doc->loadHTML($str); $elems = $doc->getElementsByTagName('body'); mapOntoTextNodes($elems->item(0), 'foo'); 用于将给定的回调函数映射到DOM文档中的每个DOMText节点。您可以传递整个DOMDocument节点,也可以仅传递特定的DOMNode(在这种情况下只传递mapOntoTextNodes节点)。

然后,函数BODY用于通过将内容字符串拆分为非URL /来查找和替换 DOMText 节点内容中的纯URL使用preg_split URL 部分,同时捕获使用的分隔符,从而生成1 + 2· n 项的数组。然后非URL 部分被新的 DOMText 节点替换, URL 部分被新的foo元素替换,然后是在原始 DOMText 节点之前插入,然后在最后删除。由于这个A递归遍历,只需在特定的 DOMNode 上调用该函数即可。

答案 2 :(得分:1)

感谢您的回复,但它仍然有效。我已修复使用此功能:

function livelinked ($text){
        preg_match_all("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)|^(jpg)#ie", $text, $ccs);
        foreach ($ccs[3] as $cc) {
           if (strpos($cc,"jpg")==false  && strpos($cc,"gif")==false && strpos($cc,"png")==false ) {
              $old[] = "http://".$cc;
              $new[] = '<a href="http://'.$cc.'" target="_blank">'.$cc.'</a>';
           }
        }
        return str_replace($old,$new,$text);
}

答案 3 :(得分:0)

如果您想继续使用正则表达式(在这种情况下,正则表达式非常合适),您可以使正则表达式仅匹配“独立”的URL。使用word boundary escape sequence\b),您只能使用正字符匹配,其中http前面有空格或文本的开头:

preg_replace("#\b((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $text);
            // ^^ thar she blows

因此,"http://..."将不匹配,但http://将作为自己的词。

答案 4 :(得分:0)

DomDocument更成熟,运行速度更快,所以如果有人想使用PHP Simple HTML DOM Parser,它只是一个替代方案:

<?php
require_once('simple_html_dom.php');

$html = str_get_html('sometext sometext http://www.somedomain.com/index.html sometext sometext
<a href="http://www.somedomain.com/index.html">http://www.somedomain.com/index.html</a>
sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');

foreach ($html->find('text') as $element)
{
    // you can add any tag into the array to exclude from replace
    if (!in_array($element->parent()->tag, array('a')))
        $element->innertext = preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $element->innertext);
}

echo $html;

答案 5 :(得分:0)

您可以尝试this question中的代码:

echo preg_replace('/<a href="([^"]*)([^<\/]*)<\/a>/i', "$1", 'sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');

如果你想转一些其他标签 - 这很容易:

echo preg_replace('/<img src="([^"]*)([^\/><]*)>/i', "$1", 'sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext');

答案 6 :(得分:0)

在url字符串的开头和结尾匹配一个空格(\ s),这将确保

"http://url.com" 

不匹配
http://url.com 

匹配;