Question

我有可能（或可能不）包含链接的字符串。如果链接存在，则将其用[link] [/ link]标记包围。我想用一些特殊的标记（例如URL）替换那些部分。并返回相应的链接。

示例

让我们假设函数detect_link是这样做的：

>input= 'The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times'
>replacement_token = "URL"
>link,new_sentence = detect_link(input,replacement_token)
>link
'http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/'
>new_sentence
'The statement URL The Washington Times'

我搜索了一下，发现可以使用正则表达式来做到这一点。但是，我对它们没有任何经验。有人可以帮我吗？

编辑链接没有任何恒定模式。它可能会或可能不会以http开头。它可能会也可能不会以.com等结尾

Answer 1

您需要一个正则表达式模式。我使用http://www.regex101.com来处理正则表达式。

您可以使用该模式提取内容并替换内容，例如：

import re

text = 'The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times'

# get what what matched
for mat in re.findall(r"\[link\](.*?)\[/link\]",text):
    print(mat)

# replace a match with sthm other
print( re.sub(r"\[link\](.*?)\[/link\]","[URL]",text))

输出：

http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ 

The statement [URL] The Washington Times

我使用的模式是非贪婪的，因此，如果一个句子中出现多个[link] [/ link]部分，而只有最短的部分，则不会匹配它们：

\[link\](.*?)\[/link\]   - matches a literal [ followed by link followed by literal ]
                           with as few things before matching the endtag [/link]

如果没有非贪心匹配，则整个

只能替换一次

The statement [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] and this also [link] http://www.washingtontimes.com/news/2017/sep/9/rob-ranco-texas-lawyer-says-he-would-be-ok-if-bets/ [/link] The Washington Times

而不是两个。

找到所有链接：

import re
text = """
The statement [link] link 1 [/link] and [link] link 2 [/link] The Washington Times
The statement [link] link 3 [/link] and [link] link 4 [/link] The Washington Times
"""

# get what what matched
links = re.findall(r"\[link\](.*)\[/link\]",text)        # greedy pattern
links_lazy = re.findall(r"\[link\](.*?)\[/link\]",text)  # lazy pattern

输出：

# greedy
[' link 1 [/link] and [link] link 2 ', 
 ' link 3 [/link] and [link] link 4 ']
# lazy
[' link 1 ', ' link 2 ', ' link 3 ', ' link 4 ']

如果您要在匹配文本中不包含换行符，则可以看到区别-(*.)与换行符不匹配-因此，如果句子中有多个链接，则需要(.*?)匹配项既可以使两者匹配，又可以使整个零件都匹配。

如何在python3中检测和删除字符串内的链接

1 个答案: