Question

我想从字符串中提取完整的网址。

我的代码是：

import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
print re.match(r'(ftp|http)://.*\.(jpg|png)$', data)

输出：

None

预期产出

http://www.google.com/a.jpg

我在StackOverflow上发现了很多问题，但没有一个对我有用。我看过很多帖子，这不是重复的。请帮我！感谢。

Answer 1

你很亲密！

请改为尝试：

r'(ftp|http)://.*\.(jpg|png)'

您可以将此here可视化。

我也会像这样非贪婪：

r'(ftp|http)://.*?\.(jpg|png)'

您可以将这种贪婪行为与非贪婪行为here和here形象化。

默认情况下，.*会尽可能多地匹配文字，但您希望尽可能少地匹配文字。

您的$会在该行的末尾锚定匹配项，但在您的示例中，该网址的结尾不是该行的结尾。

另一个问题是您使用的是re.match()而不是re.search()。使用re.match()在字符串的开头开始匹配，re.search()搜索字符串中的任何位置。有关详细信息，请参阅here。

Answer 2

您应该使用搜索而不是匹配。

import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
url=re.search('(ftp|http)://.*\.(jpg|png)', data)
if url:
   print url.group(0)

Answer 3

使用find（http：//，ftp：//）查找url的开头。使用find（jpg，png）查找url的结尾。现在得到子串

data = "ahahahttp://www.google.com/a.jpg>hhdhd"
start = data.find('http://')
kk = data[start:]
end = kk.find('.jpg')
print kk[0:end+4]