Question

我已经在这个正则表达式上工作了很长一段时间没有太多运气。

基本上，我希望解决以下问题：

Match:

http://ourwebsite.com/index.html <-- match index.html only
ourwebsite.com/index.html <-- match index.html only
ourwebsite.com/about.html#something <-- match about.html only
index.html <-- match index.html
/about.html <-- match about.html (do not match /, only about.html)
/index <-- match index
/index/ <-- match index/
index/ <-- match index/
/about <-- match about
/about/ <-- match about/
about/ <-- match about/
/about/us/ <-- match about/us/

No match:

someotherwebsite.com/index.html <-- do not match anything
someotherwebsite.com/index <-- do not match anything

换句话说，只匹配内部网站链接，但取消了起始/。

这是我迄今为止构建的内容：

^(?:(?:https?):\/\/|\b)http\:\/\/ourwebsite\.com.*|(^[a-zA-Z0-9]*\.[a-z]{3,})

这个正则表达式解决了我尝试做的大部分事情，但仍然匹配someotherwebsite.com。

我猜我的正则表达式也不是完全最优的..是否有更简单的方法来做到这一点？

顺便说一句，我正在使用Python。如果有任何图书馆可以做到这一点，我全心全意。

Answer 1

以下是我的假设 -

网址格式为yourwebsite.com/blah，所有网页至少包含yourwebsite.com或www.yourwebiste.com作为文字

所以，我创建了一个3个样本的字典，具体取决于它是否包含https，www或没有www -

d = ["https://www.example.com/index.html", "www.example.com/index.html", "example.com/index.html"]

接下来，由于我们总是只搜索匹配项，因此我们将其与example.com分开，因为无论如何它都保持不变。

显示上面dict的所有元素，我们有

import re
for i in d:
    parts = re.split(r'example.com/', i)
    print(parts)

给出了如下输出 -

[＆＃39; https://www。＆＃39;，＆＃39; index.html＆＃39;] [＆＃39; www。＆＃39;，＆＃39; index.html＆＃39;] [＆＃39;＆＃39;，＆＃39; index.html＆＃39;]

您可以随时使用parts[1]选择第二个进行处理。

Answer 2

第一个提案

此正则表达式将为您提供您提供的网址中的相对网址，但它不会为您提供域名区别。

^（：HTTP：//）？（？：（？：？WWW）ourwebsite.com）？（？：/）（[A-Z0-9 /] +）

<强>试验：

https://regex101.com/r/Oi2jh8/1

<强>解释

作为非捕获组的可选http://前缀
作为非捕获组的可选www.前缀
作为非捕获组的可选ourwebsite.com域
作为非捕获组的可选/域路径分隔符
捕获包含字符[a-z0-9/.]（不是#或?的路径，将在那里结束，您可以使用_或-扩展列表等）

第二项提案

^（：HTTP：//）？（（？：？WWW）.A-Z0-9-_] + /）（[A-Z0-9 /] +）

哪个也将域作为捕获组匹配，如果域存在，您将获得匹配的组长度为2，然后您可以消除匹配[0]与ourwebsite.com不匹配：

<强>测试

https://regex101.com/r/Oi2jh8/3

注意，如果你想在不使用python上的正则表达式的情况下解析网址：

from urlparse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
        params='', query='', fragment='')

取自：https://docs.python.org/2/library/urlparse.html

仅使用正则表达式匹配内部域链接

2 个答案:

第一个提案

第二项提案