我有可能(或可能不)包含链接的字符串。如果链接存在,则将其用[link] [/ link]标记包围。我想用一些特殊的标记(例如URL
)替换那些部分。并返回相应的链接。
示例
让我们假设函数detect_link
是这样做的:
>input= 'The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times'
>replacement_token = "URL"
>link,new_sentence = detect_link(input,replacement_token)
>link
'http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/'
>new_sentence
'The statement URL The Washington Times'
我搜索了一下,发现可以使用正则表达式来做到这一点。但是,我对它们没有任何经验。有人可以帮我吗?
编辑 链接没有任何恒定模式。它可能会或可能不会以http开头。它可能会也可能不会以.com等结尾
答案 0 :(得分:2)
您需要一个正则表达式模式。我使用http://www.regex101.com来处理正则表达式。
您可以使用该模式提取内容并替换内容,例如:
import re
text = 'The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times'
# get what what matched
for mat in re.findall(r"\[link\](.*?)\[/link\]",text):
print(mat)
# replace a match with sthm other
print( re.sub(r"\[link\](.*?)\[/link\]","[URL]",text))
输出:
http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/
The statement [URL] The Washington Times
我使用的模式是非贪婪的,因此,如果一个句子中出现多个[link] [/ link]部分,而只有最短的部分,则不会匹配它们:
\[link\](.*?)\[/link\] - matches a literal [ followed by link followed by literal ]
with as few things before matching the endtag [/link]
如果没有非贪心匹配,则整个
只能替换一次The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] and this also [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times
而不是两个。
找到所有链接:
import re
text = """
The statement [link] link 1 [/link] and [link] link 2 [/link] The Washington Times
The statement [link] link 3 [/link] and [link] link 4 [/link] The Washington Times
"""
# get what what matched
links = re.findall(r"\[link\](.*)\[/link\]",text) # greedy pattern
links_lazy = re.findall(r"\[link\](.*?)\[/link\]",text) # lazy pattern
输出:
# greedy
[' link 1 [/link] and [link] link 2 ',
' link 3 [/link] and [link] link 4 ']
# lazy
[' link 1 ', ' link 2 ', ' link 3 ', ' link 4 ']
如果您要在匹配文本中不包含换行符,则可以看到区别-(*.)
与换行符不匹配-因此,如果句子中有多个链接,则需要(.*?)
匹配项既可以使两者匹配,又可以使整个零件都匹配。