如何使用此功能将URL替换为RegEx以排除URL

时间:2012-08-26 00:35:48

标签: php regex

我遇到了这个优秀的“小”RegEx来替换纯文本中的URL(而不是超链接)。 唯一的问题是我对RegEx知之甚少,所以我完全坚持让我的博客正常工作。

所以,我要求帮助排除网址,例如$exception_url = 'http://mysite.com'

function strip_urls($text, $xception_url = FALSE)
{
    return preg_replace("/( (?:
    (?:https?|ftp) : \\/*
    (?:
        (?: (?: [a-zA-Z0-9-]{2,} \\. )+
            (?: arpa | com | org | net | edu | gov | mil | int | [a-z]{2}
                | aero | biz | coop | info | museum | name | pro
                | example | invalid | localhost | test | local | onion | swift ) )
        | (?: [0-9]{1,3} \\. [0-9]{1,3} \\. [0-9]{1,3} \\. [0-9]{1,3} )
        | (?: [0-9A-Fa-f:]+ : [0-9A-Fa-f]{1,4} )
    )
    (?: : [0-9]+ )?
    (?! [a-zA-Z0-9.:-] )
    (?:
        \\/
        [^&?#\\(\\)\\[\\]\\{\\}<>\\'\\\"\\x00-\\x20\\x7F-\\xFF]*
    )?
    (?:
        [?#]
        [^\\(\\)\\[\\]\\{\\}<>\\'\\\"\\x00-\\x20\\x7F-\\xFF]+
    )?
) | (?:
    (?:
        (?: (?: [a-zA-Z0-9-]{2,} \\. )+
            (?: arpa | com | org | net | edu | gov | mil | int | [a-z]{2}
                | aero | biz | coop | info | museum | name | pro
                | example | invalid | localhost | test | local | onion | swift ) )
        | (?: [0-9]{1,3} \\. [0-9]{1,3} \\. [0-9]{1,3} \\. [0-9]{1,3} )
    )
    (?: : [0-9]+ )?
    (?! [a-zA-Z0-9.:-] )
    (?:
        \\/
        [^&?#\\(\\)\\[\\]\\{\\}<>\\'\\\"\\x00-\\x20\\x7F-\\xFF]*
    )?
    (?:
        [?#]
        [^\\(\\)\\[\\]\\{\\}<>\\'\\\"\\x00-\\x20\\x7F-\\xFF]+
    )?
) | (?:
    [a-zA-Z0-9._-]{2,} @
    (?:
        (?: (?: [a-zA-Z0-9-]{2,} \\. )+
            (?: arpa | com | org | net | edu | gov | mil | int | [a-z]{2}
                | aero | biz | coop | info | museum | name | pro
                | example | invalid | localhost | test | local | onion | swift ) )
        | (?: [0-9]{1,3} \\. [0-9]{1,3} \\. [0-9]{1,3} \\. [0-9]{1,3} )
    )
) )/Dx", '', $text);
}

非常感谢答案,谢谢。

2 个答案:

答案 0 :(得分:2)

改变正则表达式几乎是不可能的,最终会变得很大。

然而,您可以暂时替换异常URL的部分,将其标识为带有一些伪造字符串的URL,然后在正则表达式之后将其替换回来(如果您真的想要偏执,则可以确保替换字符串在文本中不存在(或者在URL剥离后不存在),如果是,则附加一个随机数,直到它们不存在):

$identifier = '.com';
$temp_replace = '@@@STRIP_URLS-COM@@@';
$identifier2 = '://';
$temp_replace2 = '@@@STRIP_URLS-SLASHES@@@';
if ($exception_url) {
    $text = str_replace($exception_url, str_replace(array($identifier, $identifier2), array($temp_replace, $temp_replace2), $exception_url), $text);
}

$text = preg_replace(...)
....rest of regex here...

if ($exception_url) {
    $text = str_replace(array($temp_replace, $temp_replace2), array($identifier, $identifier2), $text);
}
return $text;

答案 1 :(得分:0)

我相信有人会觉得这很有用。

您可以指定相对网址,即允许来自您网站的网址:

strip_urls($blog_comment, 'http://www.mysite.com/');

来自一组合作伙伴域:

strip_url($blog_comment, array('http://mysite.com/', 'http://partner.com/', 'http://partner1.com/')).

使用Mihai Loga的使用占位符的想法,我修改了初始脚本以将数组或字符串作为$ exception_url。我还制作了占位符以使其更安全。

function strip_urls($text, $exception_url = array())
{
    if( ! empty($exception_url))
    {
    if(is_string($exception_url)) $exception_url = array($exception_url);

$placeholder_array = array();
$placeholder = md5(uniqid());

if(strpos($text, $placeholder))
{
    while(strpos($text, $placeholder))
    {
    $placeholder = md5(uniqid());
    }
}

for($i = 0; $i < count($exception_url); $i++)
{
    if( ! is_string($exception_url[$i]))
    {
    unset($exception_url[$i]);
    $exception_url = array_values($exception_url);
    continue;
    }

    $pos = mb_strpos($text, $exception_url[$i]);

    if (FALSE === $pos) continue;

    $text = substr_replace($text, $placeholder + $i, $pos, mb_strlen($exception_url[$i]));
    $placeholder_array[] = $placeholder + $i;
}
}

$text = preg_replace("/( (?:
    (?:https?|ftp) : \\/*
    (?:
        (?: (?: [a-zA-Z0-9-]{2,} \\. )+
            (?: arpa | com | org | net | edu | gov | mil | int | [a-z]{2}
                | aero | biz | coop | info | museum | name | pro
                | example | invalid | localhost | test | local | onion | swift ) )
        | (?: [0-9]{1,3} \\. [0-9]{1,3} \\. [0-9]{1,3} \\. [0-9]{1,3} )
        | (?: [0-9A-Fa-f:]+ : [0-9A-Fa-f]{1,4} )
    )
    (?: : [0-9]+ )?
    (?! [a-zA-Z0-9.:-] )
    (?:
        \\/
        [^&?#\\(\\)\\[\\]\\{\\}<>\\'\\\"\\x00-\\x20\\x7F-\\xFF]*
    )?
    (?:
        [?#]
        [^\\(\\)\\[\\]\\{\\}<>\\'\\\"\\x00-\\x20\\x7F-\\xFF]+
    )?
) | (?:
    (?:
        (?: (?: [a-zA-Z0-9-]{2,} \\. )+
            (?: arpa | com | org | net | edu | gov | mil | int | [a-z]{2}
                | aero | biz | coop | info | museum | name | pro
                | example | invalid | localhost | test | local | onion | swift ) )
        | (?: [0-9]{1,3} \\. [0-9]{1,3} \\. [0-9]{1,3} \\. [0-9]{1,3} )
    )
    (?: : [0-9]+ )?
    (?! [a-zA-Z0-9.:-] )
    (?:
        \\/
        [^&?#\\(\\)\\[\\]\\{\\}<>\\'\\\"\\x00-\\x20\\x7F-\\xFF]*
    )?
    (?:
        [?#]
        [^\\(\\)\\[\\]\\{\\}<>\\'\\\"\\x00-\\x20\\x7F-\\xFF]+
    )?
) | (?:
    [a-zA-Z0-9._-]{2,} @
    (?:
        (?: (?: [a-zA-Z0-9-]{2,} \\. )+
            (?: arpa | com | org | net | edu | gov | mil | int | [a-z]{2}
                | aero | biz | coop | info | museum | name | pro
                | example | invalid | localhost | test | local | onion | swift ) )
        | (?: [0-9]{1,3} \\. [0-9]{1,3} \\. [0-9]{1,3} \\. [0-9]{1,3} )
    )
) )/Dx", '', $text);

return (empty($exception_url))? $text : str_replace($placeholder_array, $exception_url, $text);

}

归功于Mihai Loga并设计了这个RegEx ......一切都以一个好主意开始。