尝试找出regex
检测文本中的网址,但<a href="url">...</a>
已包围的网址除外,并用标记围绕它们。
input: "http://google.sk this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"
input: "<a href="http://google.sk">http://google.sk</a> this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"
这个answer给了我很多帮助,但它并不期望已经包含了URL。
def fix_urls(text):
pat_url = re.compile( r'''
(?x)( # verbose identify URLs within text
(https|http|ftp|gopher) # make sure we find a resource type
:// # ...needs to be followed by colon-slash-slash
(\w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
(/?| # could be just the domain name (maybe w/ slash)
[^ \n\r"]+ # or stuff then space, newline, tab, quote
[\w/]) # resource name ends in alphanumeric or slash
(?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
) # end of match group
''')
for url in re.findall(pat_url, text):
text = text.replace(url[0], '<a href="%(url)s">%(url)s</a>' % {"url" : url[0]})
return text
如果文字中有任何<a>
标记,则此功能会再次包装其中我不想要的网址。你知道怎么做吗?
答案 0 :(得分:1)
使用否定的lookbehind检查href="
是否在您的网址(第二行)之前:
(?x) # verbose
(?<!href=\") #don't match already inside hrefs
(https?|ftp|gopher) # make sure we find a resource type
:// # ...needs to be followed by colon-slash-slash
((?:\w+[:.]?){2,}) # at least two domain groups, e.g. (gnosis.)(cx) fixed capture group*
(/?| # could be just the domain name (maybe w/ slash)
[^ \n\r\"]+ # or stuff then space, newline, tab, quote
[\w\/]) # resource name ends in alphanumeric or slash
(?=[\s\.,>)'\"\]]) # assert: followed by white or clause ending