链接未链接的URL(BBCode)正则表达式

时间:2011-10-08 06:06:20

标签: php regex

我需要一个正则表达式,查找任何不在[url(= ...)] ... [/ url]标签内的网址。换句话说,我想链接任何未链接的URL并用[url] link [/ url]替换链接,以便我正在使用的解析器可以照常处理它。

我一直试图理解消极的前瞻(这显然是我应该使用的),但我无法理解它。

这是我到目前为止所得到的:

preg_replace('/(?!\[url(=.*?)?\])(https?|ftps?|irc):\/\/(www\.)?(\w+(:\w+)?@)?[a-z0-9-]+(\.[a-z0-9-])*.*(?!\[\/url\])/i',"[url]$0[/url]",$Str);

由于

3 个答案:

答案 0 :(得分:3)

我的解决方案:

<?php
$URLRegex = '/(?:(?<!(\[\/url\]|\[\/url=))(\s|^))';     // No [url]-tag in front and is start of string, or has whitespace in front
$URLRegex.= '(';                                        // Start capturing URL
$URLRegex.= '(https?|ftps?|ircs?):\/\/';                // Protocol
$URLRegex.= '\S+';                                      // Any non-space character
$URLRegex.= ')';                                        // Stop capturing URL
$URLRegex.= '(?:(?<![[:punct:]])|(?<=\/))(\s|\.?$)/i';  // Doesn't end with punctuation (excluding /) and is end of string (with a possible dot at the end), or has whitespace after

$Str = preg_replace($URLRegex,"$2[url]$3[/url]$5",$Str);
?>

答案 1 :(得分:1)

这里有一个很好的URL匹配正则表达式:

http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

答案 2 :(得分:1)

关联未链接的网址并非易事。有很多陷阱(参见:The Problem with URLs)以及此博客条目后的评论主题。如果您已经链接了要跳过的URL,则问题会更加复杂。我已经研究过这个问题,并且一直在研究解决方案 - 一个开源项目:LinkifyURL。这是函数的最新版本,可以满足您的需求。请注意,正则表达式并不简单(但事实并非如此)。

function linkify($text) {
    $url_pattern = '/# Rev:20100913_0900 github.com\/jmrware\/LinkifyURL
    # Match http & ftp URL that is not already linkified.
      # Alternative 1: URL delimited by (parentheses).
      (\()                     # $1  "(" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $2: URL.
      (\))                     # $3: ")" end delimiter.
    | # Alternative 2: URL delimited by [square brackets].
      (\[)                     # $4: "[" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $5: URL.
      (\])                     # $6: "]" end delimiter.
    | # Alternative 3: URL delimited by {curly braces}.
      (\{)                     # $7: "{" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $8: URL.
      (\})                     # $9: "}" end delimiter.
    | # Alternative 4: URL delimited by <angle brackets>.
      (<|&(?:lt|\#60|\#x3c);)  # $10: "<" start delimiter (or HTML entity).
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $11: URL.
      (>|&(?:gt|\#62|\#x3e);)  # $12: ">" end delimiter (or HTML entity).
    | # Alternative 5: URL not delimited by (), [], {} or <>.
      (                        # $13: Prefix proving URL not already linked.
        (?: ^                  # Can be a beginning of line or string, or
        | [^=\s\'"\]]          # a non-"=", non-quote, non-"]", followed by
        ) \s*[\'"]?            # optional whitespace and optional quote;
      | [^=\s]\s+              # or... a non-equals sign followed by whitespace.
      )                        # End $13. Non-prelinkified-proof prefix.
      ( \b                     # $14: Other non-delimited URL.
        (?:ht|f)tps?:\/\/      # Required literal http, https, ftp or ftps prefix.
        [a-z0-9\-._~!$\'()*+,;=:\/?#[\]@%]+ # All URI chars except "&" (normal*).
        (?:                    # Either on a "&" or at the end of URI.
          (?!                  # Allow a "&" char only if not start of an...
            &(?:gt|\#0*62|\#x0*3e);                  # HTML ">" entity, or
          | &(?:amp|apos|quot|\#0*3[49]|\#x0*2[27]); # a [&\'"] entity if
            [.!&\',:?;]?        # followed by optional punctuation then
            (?:[^a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]|$)  # a non-URI char or EOS.
          ) &                  # If neg-assertion true, match "&" (special).
          [a-z0-9\-._~!$\'()*+,;=:\/?#[\]@%]* # More non-& URI chars (normal*).
        )*                     # Unroll-the-loop (special normal*)*.
        [a-z0-9\-_~$()*+=\/#[\]@%]  # Last char can\'t be [.!&\',;:?]
      )                        # End $14. Other non-delimited URL.
    /imx';
    $url_replace = '$1$4$7$10$13<a href="$2$5$8$11$14">$2$5$8$11$14</a>$3$6$9$12';
    return preg_replace($url_pattern, $url_replace, $text);
}

这个解决方案确实有一些局限性,最近我一直在研究一个改进的版本(更简单,效果更好) - 但它尚未准备好迎接黄金时段。

请务必查看linkify test page,其中列出了真实难以匹配的网址。