如何仅剥离所有锚标签和href属性?

时间:2014-03-13 09:24:03

标签: php html regex anchor

我有一个场景,我需要从HTML内容中删除所有锚点,但是这样做时我不想剥离锚点标记的href部分。

目前我正在使用此正则表达式使用preg_replace()去除锚点。

<a [^>]*> strips all the anchor tag
<a.+href\=[\"|\'](.+)[\"|\'].*\>.*\<\/a\> - matches href

示例字符串:     &#34; anchor href =&#34; mailto:xyz@gmail.com"&gt; namemail anchor&#34;

在做了preg_replace()之后,我应该得到&#34; mailto:xyz@gmail.com"字符串作为文本休息都应该被删除。

3 个答案:

答案 0 :(得分:1)

试试这个正则表达式:

~<a.+?href=(["'])(.+?)\1.*?>.*?</a>~is

描述

Regular expression visualization

详细说明

~<a.+?href=(["'])(.+?)\1.*?>.*?</a>~is

<a    # matches the characters <a literally (case sensitive)
.+?   # matches any character, the least possible
href= # matches the characters href= literally (case sensitive)
1st Capturing group (["'])
    ["'] # matches a single character. Either " or '
2nd Capturing group (.+?)
    .+?  # matches any character, the least possible
\1    # matches a single character corresponding the character found in first capturing group.
.*?   # matches zero or more characters, the least possible
>     # matches the character > literally
.*?   # matches zero or more characters, the least possible
</a>  # matches the characters </a> literally (case sensitive)
i modifier: ignore case
s modifier: single line. Dot matches newline characters

NOTA: The ~ between the regex delimit it and allow us to don't escape /.

演示

http://regex101.com/r/fX1fP1

一些注释

  • [\"|\']

    不要超越你的逃生。只有在要明确匹配元字符时才转义元字符。请改用["|']

  • ["|']

    除非你想匹配它,否则不要在字符类中使用|。字符类中的字符已经OR编辑。请查看以下说明:

    当您键入["|']时,正则表达式会看到: Regular expression visualization

    当您键入["']时,正则表达式会看到: Regular expression visualization

答案 1 :(得分:1)

$html = '<a href="http://www..." x=asdasda?></a>';
$html = preg_replace("|<a[^>]*href\s*=\s*([\"'])([^\"']*)\\1[^>]*>[^<]*</a>|si", "$2", $html);

输出:

http://www...

答案 2 :(得分:1)

通过使用DOMDocument解析HTML而不是尝试使用正则表达式,您将获得更大的成功:

以下是可以应该做的概念验证:

function replaceAnchorTags($html) {
    //Intialise document using provided HTML
    $doc = new DOMDocument();
    @$doc->loadHTML($html);         //suppress invalid HTML warnings
    $doc_elem = $doc->documentElement;

    traverse($doc, $doc_elem);
    return $doc->saveHTML();
}

function traverse(&$doc, $elem) {
    if ($elem->nodeType === XML_ELEMENT_NODE and $elem->tagName == "a") {
        $href = $elem->getAttribute("href");
        // Obviously here you might want to keep the anchor's inner HTML as
        // well as the URL...
        $text_replacement = $doc->createTextNode($href);
        $elem->parentNode->replaceChild($text_replacement, $elem);
    }

    if ($elem->hasChildNodes()) {
        $children = $elem->childNodes;
        for ($i=0, $max=$children->length; $i<$max; $i++) {
            traverse($doc, $children->item($i));
        }
    }
}

$html = "<p>Hello <a href='http://twitter.com'>Brave New</a> World</p>";

echo replaceAnchorTags($html);