Question

我认为这是一个常见问题。但我在别处找不到满意的答案。

假设我从网站中提取了一些链接。链接如下：

http://example.com/goto/http://example1.com/123.html
http://example1.com/456.html
http://example.com/yyy/goto/http://example2.com/789.html
http://example3.com/xxx.html

我想使用正则表达式将它们转换为真正的链接：

http://example1.com/123.html
http://example1.com/456.html
http://example2.com/789.html
http://example3.com/xxx.html

然而，由于RE的贪婪特征，我无法做到这一点。 'http://.*$'只会匹配整个句子。然后我尝试了'http://.*?$'，但它也没有用。 re.findall也没有。那么还有其他方法吗？

是。我可以str.split或str.index执行此操作。但我仍然对是否有RE解决方案感到好奇。

Answer 1

您不需要使用正则表达式，您可以使用str.split()将链接与//分开，然后拾取最后一部分并将其与http//连接：

>>> s="""http://example.com/goto/http://example1.com/123.html
... http://example1.com/456.html
... http://example.com/yyy/goto/http://example2.com/789.html
... http://example3.com/xxx.html"""
>>> ['http://'+s.split('//')[-1] for link in s.split('\n')]
['http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html', 'http://example3.com/xxx.html']

使用正则表达式，您只需要用空字符串替换2 //之间的所有字符，但首次使用positive look-behind时需要//之一{/ 3}}：

>>> [re.sub(r'(?<=//)(.*)//','',link) for link in s.split('\n')]
['http://example1.com/123.html', 'http://example1.com/456.html', 'http://example2.com/789.html', 'http://example3.com/xxx.html']
>>>

Answer 2

~~使用此模式~~

~~^(.*?[^/])(?=\/[^/]).*?([^/]+)$~~

并替换为$1/$2
Demo

阅读下面的评论后，使用此模式捕捉您想要的内容

(http://(?:[^h]|h(?!ttp:))*)$

Demo

或此模式

(http://(?:(?!http:).)*)$

Demo

或此模式

http://.*?(?=http://)

并替换为什么 Demo

如何通过正则表达式得到最正确的比赛？

2 个答案: