Question

如何编写正则表达式并区分a）顶级网址和b）这些顶级网址中的链接。

For e.g, if the top level url is http://www.example.com/ 

and other links inside this top folder can be,
http://www.example.com/go
http://www.example.com/contact/
http://www.example.com/links/

我不知道顶部文件夹中有哪些链接，是否有正则表达式可以选择主要文件夹中的主要文件夹以及所有这些子文件夹。

感谢。

Answer 1

我建议从正则表达式开始，将url分解为其组件。有很多例子。这篇文章取自 The Regex Cookbook 的作者Jan Goyvaerts：

(?i)\b(?<protocol>https?|ftp)://(?<domain>[-A-Z0-9.]+)(?<file>/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(?<parameters>\?[A-Z0-9+&@#/%=~_|!:,.;]*)?

网址的不同部分可在各种捕获组中使用（在 DEMO 中，查看右侧窗格中的“组”。）

然后，如果要匹配更少的组件，请缩短正则表达式：

^(?im)\b(?<protocol>https?|ftp)://(?<domain>[-A-Z0-9.]+)/?$

在 the second demo 中查看此内容如何与没有文件的网址匹配。

Answer 2

由于您不想验证URL，因此只需从索引1（顶级网址）和2（任何后跟顶级网址）获取匹配的组，并将其封闭在括号内(...)

^http:\/\/([^\/]*)\/(.*)$

以下是DEMO并点击code generator link以获得所需语言的代码。

模式说明：

  ^                        the beginning of the string
  http:                    'http:'
  \/                       '/'
  \/                       '/'
  (                        group and capture to \1:
    [^\/]*                   any character except: '\/' (0 or more times (Greedy))
  )                        end of \1
  \/                       '/'
  (                        group and capture to \2:
    .*                       any character except \n (0 or more times (Greedy))
  )                        end of \2
  $                        before an optional \n, and the end of the string

如果网址在字符串内或多行中跨度，请使用以下正则表达式：

\bhttp:\/\/([^\/]*)\/([^\s]*)

DEMO

用于过滤URL的正则表达式

2 个答案: