Question

因此，以下正则表达式（用python3编写）只是将添加到更大的正则表达式中的一部分，以将URL拆分为模式，域和路径。这部分是提取路径。

link = "http://google.com/whatever/who/jx.html"
components = re.split(r'(?<![:/])(/.*$)', link)

返回以下内容：

['http://google.com', '/whatever/who/jx.html', '']

为什么正则表达式在列表末尾返回一个额外的元素？

Answer 1

'(?<![:/])(/.*$)' matches '/whatever/who/jx.html' in your string.因此，您的字符串会在匹配前分成内容，匹配本身以及匹配后的内容。你得到这些元素（用方括号表示匹配）：

'http://google.com'['/whatever/who/jx.html']''

因此得到最终结果数组：

['http://google.com', '/whatever/who/jx.html', '']

指定人：
https://docs.python.org/2/library/stdtypes.html#str.split

Answer 2

它认为最好在这里使用re.match和稍微不同的模式：

>>> import re
>>> link = "http://google.com/whatever/who/jx.html"
>>> re.match("(https?://.+?)(/.*$)", link).groups()
('http://google.com', '/whatever/who/jx.html')
>>>

以下是上面使用的正则表达式匹配的细分：

(        # The start of the first capture group
http     # http
s?       # An optional s
://      # ://
.+?      # One or more characters matched non-greedily
)        # The close of the first capture group
(        # The start of the second capture group
/        # /
.*       # Zero or more characters
$        # The end of the string
)        # The close of the second capture group

为什么这个正则表达式拆分返回的组件多于预期？

2 个答案: