Question

我有一个正则表达式，在某些文本中查找网址，如：

my_urlfinder = re.compile(r'\shttp:\/\/(\S+.|)blah.com/users/(\d+)(\/|)')
text = "blah blah http://blah.com/users/123 blah blah http://blah.com/users/353"

for match in my_urlfinder.findall(text):
    print match  #prints an array with all the individual parts of the regex

如何获取整个网址？目前匹配只打印出匹配的部分（我需要其他东西）...但我也想要完整的网址。

Answer 1

你应该让你的小组不被捕捉：

my_urlfinder = re.compile(r'\shttp:\/\/(?:\S+.|)blah.com/users/(?:\d+)(?:\/|)')

当有捕获组时，

findall() 会更改行为。对于组，它只返回组，而不捕获组，而是返回整个匹配的文本。

演示：

>>> text = "blah blah http://blah.com/users/123 blah blah http://blah.com/users/353"
>>> my_urlfinder = re.compile(r'\shttp:\/\/(?:\S+.|)blah.com/users/(?:\d+)(?:\/|)')
>>> for match in my_urlfinder.findall(text):
...     print match
... 
 http://blah.com/users/123
 http://blah.com/users/353

Answer 2

不使用任何捕获组的替代方法是在所有内容周围添加另一个：

my_urlfinder = re.compile(r'\s(http:\/\/(\S+.|)blah.com/users/(\d+)(\/|))')

这将允许您在保持整个结果的同时保留内部捕获组。

对于演示文本，它会产生这些结果：

('http://blah.com/users/123', '', '123', '')
('http://blah.com/users/353', '', '353', '')

作为旁注，请注意当前表达式需要URL前面的空格，所以如果文本以不匹配的文本开头。

findall的完整表达

2 个答案: