如何在python3中检测和删除字符串内的链接

时间:2018-10-20 10:32:32

标签: regex python-3.x

我有可能(或可能不)包含链接的字符串。如果链接存在,则将其用[link] [/ link]标记包围。我想用一些特殊的标记(例如URL)替换那些部分。并返回相应的链接。

示例

让我们假设函数detect_link是这样做的:

>input= 'The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times'
>replacement_token = "URL"
>link,new_sentence = detect_link(input,replacement_token)
>link
'http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/'
>new_sentence
'The statement URL The Washington Times'

我搜索了一下,发现可以使用正则表达式来做到这一点。但是,我对它们没有任何经验。有人可以帮我吗?

编辑 链接没有任何恒定模式。它可能会或可能不会以http开头。它可能会也可能不会以.com等结尾

1 个答案:

答案 0 :(得分:2)

您需要一个正则表达式模式。我使用http://www.regex101.com来处理正则表达式。

您可以使用该模式提取内容并替换内容,例如:

import re

text = 'The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times'

# get what what matched
for mat in re.findall(r"\[link\](.*?)\[/link\]",text):
    print(mat)

# replace a match with sthm other
print( re.sub(r"\[link\](.*?)\[/link\]","[URL]",text))

输出:

http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ 

The statement [URL] The Washington Times

我使用的模式是非贪婪的,因此,如果一个句子中出现多个[link] [/ link]部分,而只有最短的部分,则不会匹配它们:

\[link\](.*?)\[/link\]   - matches a literal [ followed by link followed by literal ]
                           with as few things before matching the endtag [/link]

如果没有非贪心匹配,则整个

只能替换一次
The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] and this also [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times

而不是两个。


找到所有链接:

import re
text = """
The statement [link] link 1 [/link] and [link] link 2 [/link] The Washington Times
The statement [link] link 3 [/link] and [link] link 4 [/link] The Washington Times
"""

# get what what matched
links = re.findall(r"\[link\](.*)\[/link\]",text)        # greedy pattern
links_lazy = re.findall(r"\[link\](.*?)\[/link\]",text)  # lazy pattern

输出:

# greedy
[' link 1 [/link] and [link] link 2 ', 
 ' link 3 [/link] and [link] link 4 ']
# lazy
[' link 1 ', ' link 2 ', ' link 3 ', ' link 4 ']

如果您要在匹配文本中不包含换行符,则可以看到区别-(*.)与换行符不匹配-因此,如果句子中有多个链接,则需要(.*?)匹配项既可以使两者匹配,又可以使整个零件都匹配。