正则表达式查找以HTML包装的字符串中的URL

时间:2014-03-08 05:52:41

标签: php regex url

关于这个问题,这里有关于SE(以及网络上)的数百个问题 - 我尝试了很多但是我找不到最终的全能正则表达式。

随意跳转到下面的 TL; DR 版本......

我需要将字符串解析为catch所有URL。

我现在正在使用它(最接近我的工作)

$content = preg_replace_callback( '/((http[s]?:|www[.])[^\s]*)/i', 'my_callback', $content );

问题是 - 它没有抓住所有网址..

    http://designscrazed.com/personal-wordpress-blog-themes/ <-- OK
    https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template <-- OK
    www.tuicool.com/articles/rqAzU3   <-- OK
    html5up.net/overflow/   <-- NOT WORKING
    http://www.tuicool.com/articles/rqAzU3    <-- OK
    http://live.btoa.com.au/spotfinder/docs/#ByVCPlik   <-- OK
    www.designrazzi.com/2013/free-css3-html5-templates/    <-- OK
    themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/   <-- NOT WORKING

我也试过没有WWW

$content = preg_replace_callback( '/(http[s]?:[^\s]*)/i', 'my_callback', $content );

甚至

 $content = preg_replace_callback( '#[-a-zA-Z0-9@:%_\+.~\#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~\#?&//=]*)?#i', 'my_callback', $content );

这三种情况都不适用于HTML链接中包含的网址...

例如,在

之类的链接中
 <a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>

它几乎可以正确捕获url,但是会将HTML部分保留为AFTER ..

http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>

生产

THIS WAS CAUGHT" target="_blank">SE</a>

TL; DR版本:

我基本上需要一个正则表达式来捕获所有url,以一种干净的方式变换:

http://www.example.com
http://example.com/
http://www.example.com/seconday/somepage#hashes?parameters
http://www.example.com/seconday/
http://www.example.com/seconday
http://example.com/seconday
http://example.com/seconday/

上述所有httphttps或无协议前缀(例如example.com/seconday)。

最重要的是 - 所有这些 可以 包装在HTML中

http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank" some_attribute='somevalue' >SE</a>

编辑我(评论后)

我写 可以 ,因为有些人也是“自由立场”,像DOMDocumentSimpleHTMLDOM使用Dom解析这样的方法会因为它们而失败不在HTML标记<a>内或没有href属性(如注释中所示 - 想想用这个问题本身解析这个自己的页面。如何解析DOM解析{{1}内的URL 1}}标签?)

1 个答案:

答案 0 :(得分:0)

好的,所以我对此进行了一次尝试并提出了以下REGEX。我确信它不会捕获所有内容,但它似乎确实捕获了您在此页面上列出的所有URL。这是一个例子:

// HERE IT IS LOOPING THROUGH AN ARRAY
$url_array = array('http://www.example.com', 'http://example.com/', 'http://www.example.com/seconday/somepage#hashes?parameters', 'http://www.example.com/seconday/', 'http://www.example.com/seconday', 'http://example.com/seconday', 'http://example.com/seconday/', 'http://designscrazed.com/personal-wordpress-blog-themes/', 'https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template', 'www.tuicool.com/articles/rqAzU3', 'html5up.net/overflow/', 'http://www.tuicool.com/articles/rqAzU3', 'http://live.btoa.com.au/spotfinder/docs/#ByVCPlik', 'www.designrazzi.com/2013/free-css3-html5-templates/', 'themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/', '<a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>');

$extension_array = array('com', 'net', 'org', 'biz');

foreach ($url_array AS $url) {

    print '<br>'.$url;
    if (preg_match('~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~i', $url, $m)) {
        print "<pre><font color='orange'>"; print_r($m); print "</font></pre>";
    }

}

或者这是相同的事情,但使用一串文字,就像你实际使用的那样:

$urls_as_string = 'asd a http://www.example.com w223 http://example.com/  ionsipn  http://www.example.com/seconday/somepage#hashes?parameters opajiw348283 http://www.example.com/seconday/ 20923[\'#$%#$ http://www.example.com/seconday wwwe http://example.com/seconday               http://example.com/seconday/ 00000002222 http://designscrazed.com/personal-wordpress-blog-themes/ +_)(&^&%$ https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template oopeorop  www.tuicool.com/articles/rqAzU3 03083 2h1hh1`  html5up.net/overflow/ kksllkwpo2 http://www.tuicool.com/articles/rqAzU3  la;sl2i2i3okn2 http://live.btoa.com.au/spotfinder/docs/#ByVCPlik black cat www.designrazzi.com/2013/free-css3-html5-templates/ asdf themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/ l  www <a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>';

$extension_array = array('com', 'net', 'org', 'biz');

if (preg_match_all('~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~i', $url_string, $m)) {
    print "<pre><font color='red'>"; print_r($m); print "</font></pre>";
}

此外,如果您在我的示例中搜索多行而不是一行,则可以添加模式修饰符'ms'。

编辑:

我在匹配行中调用$url_string的前一代码中出现错误,当我在设置内容时命名变量$urls_as_string时。如果更正变量名称,它应该按预期工作。

无论如何,我接受了上面的代码并将其修改为与您要求的preg_replace_callback一起使用。这似乎适用于您列出的所有网址。看看:

// CREATE THE STRING
$urls_as_string = 'asd a http://www.example.com w223 http://example.com/  ion
sipn  http://www.example.com/seconday/somepage#hashes?parameters 



opajiw348283 http://www.example.com/seconday/ 20923[\'#$%#$ http://www.example.com/seconday ww
we http://example.com/seconday               http://example.com/seconday/ 000000
02222 http://designscrazed.com/personal-wordpress-blog-themes/ +_)(&^&%$ https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template oopeo
rop  www.tuicool.com/articles/rqAzU3 03083 2h1hh1`  html5up.net/overflow/ kksllkwpo2 http://www.tuicool.com/articles/rqAzU3  la;s
l2i2i3okn2 http://live.btoa.com.au/spotfinder/docs/#ByVCPlik black cat www.designrazzi.com/2013/free-css3-html5-templates/ as
df themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/ l 
 www <a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>';


// SET SOME DOMAIN EXTENSIONS
$extension_array = array('com', 'net', 'org', 'biz');



// CHECK TO SEE IF OUR REGEX IS WORKING ... PRINT OUT ALL OF THE MATCHES
if (preg_match_all('~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~ims', $urls_as_string, $m)) {
    print_r($m);
}



// USE PREG_REPLACE_CALLBACK TO FORMAT THE URLS
$content = preg_replace_callback( '~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~ims', 'my_callback', $urls_as_string);



// PRINT OUT THE FINISHED STRING
print "\n\n\n\nFINAL OUTPUT: \n".$content;



// THIS FUNCTION DOES A CRAPTASTIC JOB AT FORMATTING URLS
function my_callback($m) {

    $url = $m[0];
    $url_formatted = $url;

    if (!preg_match('~^http(s)?://~', $url)) {
        $url_formatted = 'http://'.$url;
    }

    $url_formatted = '<a href="'.$url.'">'.$url.'</a>';

    return $url_formatted;

}

Here is a working demo of the code

我写的回调函数非常愚蠢,但我假设你已经有了一个你将要使用的函数。这只是为了证明它正在做它应该做的事情。希望这个解决方案能解决您的问题。如果没有,请告诉我,我可以继续处理它。