Question

我有一个场景，我需要从HTML内容中删除所有锚点，但是这样做时我不想剥离锚点标记的href部分。

目前我正在使用此正则表达式使用preg_replace()去除锚点。

<a [^>]*> strips all the anchor tag
<a.+href\=[\"|\'](.+)[\"|\'].*\>.*\<\/a\> - matches href

示例字符串：＆＃34; anchor href =＆＃34; mailto：xyz@gmail.com"＆gt; namemail anchor＆＃34;

在做了preg_replace（）之后，我应该得到＆＃34; mailto：xyz@gmail.com"字符串作为文本休息都应该被删除。

Answer 1

试试这个正则表达式：

~<a.+?href=(["'])(.+?)\1.*?>.*?</a>~is

描述

Regular expression visualization

详细说明

~<a.+?href=(["'])(.+?)\1.*?>.*?</a>~is

<a    # matches the characters <a literally (case sensitive)
.+?   # matches any character, the least possible
href= # matches the characters href= literally (case sensitive)
1st Capturing group (["'])
    ["'] # matches a single character. Either " or '
2nd Capturing group (.+?)
    .+?  # matches any character, the least possible
\1    # matches a single character corresponding the character found in first capturing group.
.*?   # matches zero or more characters, the least possible
>     # matches the character > literally
.*?   # matches zero or more characters, the least possible
</a>  # matches the characters </a> literally (case sensitive)
i modifier: ignore case
s modifier: single line. Dot matches newline characters

NOTA: The ~ between the regex delimit it and allow us to don't escape /.

演示

http://regex101.com/r/fX1fP1

一些注释

[\"|\']

不要超越你的逃生。只有在要明确匹配元字符时才转义元字符。请改用["|']。
["|']

除非你想匹配它，否则不要在字符类中使用|。字符类中的字符已经OR编辑。请查看以下说明：

当您键入["|']时，正则表达式会看到：

当您键入["']时，正则表达式会看到：

Answer 2

$html = '<a href="http://www..." x=asdasda?></a>';
$html = preg_replace("|<a[^>]*href\s*=\s*([\"'])([^\"']*)\\1[^>]*>[^<]*</a>|si", "$2", $html);

输出：

http://www...

Answer 3

通过使用DOMDocument解析HTML而不是尝试使用正则表达式，您将获得更大的成功：

以下是可以应该做的概念验证：

function replaceAnchorTags($html) {
    //Intialise document using provided HTML
    $doc = new DOMDocument();
    @$doc->loadHTML($html);         //suppress invalid HTML warnings
    $doc_elem = $doc->documentElement;

    traverse($doc, $doc_elem);
    return $doc->saveHTML();
}

function traverse(&$doc, $elem) {
    if ($elem->nodeType === XML_ELEMENT_NODE and $elem->tagName == "a") {
        $href = $elem->getAttribute("href");
        // Obviously here you might want to keep the anchor's inner HTML as
        // well as the URL...
        $text_replacement = $doc->createTextNode($href);
        $elem->parentNode->replaceChild($text_replacement, $elem);
    }

    if ($elem->hasChildNodes()) {
        $children = $elem->childNodes;
        for ($i=0, $max=$children->length; $i<$max; $i++) {
            traverse($doc, $children->item($i));
        }
    }
}

$html = "<p>Hello <a href='http://twitter.com'>Brave New</a> World</p>";

echo replaceAnchorTags($html);

如何仅剥离所有锚标签和href属性？

3 个答案:

描述

详细说明

演示

一些注释