Question

我正在尝试规范化网址，以便在://之后和行尾的最后/之前提取内容（如果存在）。

这是我的剧本：

url = "https://example.com/25194425/"
matchUrl = re.findall(r'://(.*)/?$', url)
print matchUrl

我想要的是example.com/25194425，但我得到example.com/25194425/。如何处理上一个/？为什么/?无效？

Answer 1

不使用正则表达式的另一种方法是使用urlparse

>>> from urlparse import urlparse
>>> url = 'https://example.com/25194425/'
>>> '{url.netloc}{url.path}'.format(url=urlparse(url)).rstrip('/')
'example.com/25194425'

稍后，如果要将协议，端口，参数，...部分包含在规范化的URL中。它可以更容易（比更新正则表达式）

>>> '{url.scheme}://{url.netloc}{url.path}'.format(url=urlparse(url)).rstrip('/')
'https://example.com/25194425'

Answer 2

作为评论者之一said，你只需要让量词非贪婪：

://(.*?)/?$

但是，findall()的结果是列表，而不是字符串。在这种情况下，它的列表只有一个条目，但它仍然是一个列表。要获取实际字符串，您需要提供索引：

url = "https://example.com/25194425/"
match = re.findall(r'://(.*?)/?$', url)
print match[0]

但这似乎是findall()对我的不当使用。我会选择search()：

url = "https://example.com/25194425/"
match = re.search(r'://(.*?)/?$', url)
if match:
    print match.group(1)

Answer 3

默认是可能的，因为常规匹配更多字符。所以＆＃39;（。*）/＆＃39;将匹配最后一个斜线。
你可以使用它：

matchUrl = re.findall(r'://(.*)/[^/]?$', url)

Answer 4

编辑请尝试以下模式（python 2.7x）：

import re
url1 = 'https://example.com/25194425?/'
url2 = 'https://example.com/25194425?'
print re.findall('https?://([\S]+)(?<!/)[/]?', url1)
print re.findall('https?://([\S]+)(?<!/)[/]?', url2)

输出：

['example.com/25194425?']
['example.com/25194425?']

感谢@Alan Moore指出边界问题。现在它应该适用于两种情况。

如何匹配行尾的0或1时间字符？

4 个答案: