因此,以下正则表达式(用python3编写)只是将添加到更大的正则表达式中的一部分,以将URL拆分为模式,域和路径。这部分是提取路径。
link = "http://google.com/whatever/who/jx.html"
components = re.split(r'(?<![:/])(/.*$)', link)
返回以下内容:
['http://google.com', '/whatever/who/jx.html', '']
为什么正则表达式在列表末尾返回一个额外的元素?
答案 0 :(得分:1)
'(?<![:/])(/.*$)'
matches '/whatever/who/jx.html'
in your string.因此,您的字符串会在匹配前分成内容,匹配本身以及匹配后的内容。你得到这些元素(用方括号表示匹配):
'http://google.com'['/whatever/who/jx.html']''
因此得到最终结果数组:
['http://google.com', '/whatever/who/jx.html', '']
指定人:
https://docs.python.org/2/library/stdtypes.html#str.split
答案 1 :(得分:1)
它认为最好在这里使用re.match
和稍微不同的模式:
>>> import re
>>> link = "http://google.com/whatever/who/jx.html"
>>> re.match("(https?://.+?)(/.*$)", link).groups()
('http://google.com', '/whatever/who/jx.html')
>>>
以下是上面使用的正则表达式匹配的细分:
( # The start of the first capture group
http # http
s? # An optional s
:// # ://
.+? # One or more characters matched non-greedily
) # The close of the first capture group
( # The start of the second capture group
/ # /
.* # Zero or more characters
$ # The end of the string
) # The close of the second capture group