我有一个HTML字符串,
I was surfing http://www.google.com, where I found my tweet,
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
<span>http://www.google.com</span>
到此,
I was surfing <a href="http://www.google.com">http://www.google.com</a>, where I found my tweet,
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
<span><a href="http://www.google.com">http://www.google.com</a></span>
我试试这个Demo
我的python代码是
import re
p = re.compile(ur'<a\b[^>]*>.*?</a>|((ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?)', re.MULTILINE)
test_str = u"I was surfing http://www.google.com, where I found my tweet, check it out <a href=\"http://tinyurl.com/blah\">http://tinyurl.com/blah</a>"
for item in re.finditer(p, test_str):
print item.group(0)
输出:
>>> http://www.google.com,
>>> <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
答案 0 :(得分:1)
我希望这可以帮到你。
代码:
import re
p = re.compile(ur'''[^<">]((ftp|http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?)[^< ,"'>]''', re.MULTILINE)
test_str = u"I was surfing http://www.google.com, where I found my tweet, check it out <a href=\"http://tinyurl.com/blah\">http://tinyurl.com/blah</a>"
for item in re.finditer(p, test_str):
result = item.group(0)
result = result.replace(' ', '')
print result
end_result = test_str.replace(result, '<a href="' + result + '">' + result + '</a>')
print end_result
输出:
http://www.google.com
I was surfing <a href="http://www.google.com">http://www.google.com</a>, where I found my tweet, check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
答案 1 :(得分:0)
你可以使正则表达式更复杂,但正如mikus建议的那样,做以下事情似乎更容易:
for item in re.finditer(p, test_str):
result = item.group(0)
if not "<a " in result.lower():
print(result)
答案 2 :(得分:0)
好的,我想我终于找到了你想要的东西。基本想法是尝试匹配<a href
和URL。如果有<a href
则不做任何事情,但如果没有,则添加链接。这是代码:
import re
test_str = """I was surfing http://www.google.com, where I found my tweet,
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
<span>http://www.google.com</span>
"""
def repl_func(matchObj):
href_tag, url = matchObj.groups()
if href_tag:
# Since it has an href tag, this isn't what we want to change,
# so return the whole match.
return matchObj.group(0)
else:
return '<a href="%s">%s</a>' % (url, url)
pattern = re.compile(
r'((?:<a href[^>]+>)|(?:<a href="))?'
r'((?:https?):(?:(?://)|(?:\\\\))+'
r"(?:[\w\d:#@%/;$()~_?\+\-=\\\.&](?:#!)?)*)",
flags=re.IGNORECASE)
result = re.sub(pattern, repl_func, test_str)
print(result)
输出:
I was surfing <a href="http://www.google.com">http://www.google.com</a>, where I found my tweet,
check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>
<span><a href="http://www.google.com">http://www.google.com</a></span>
主要观点来自https://stackoverflow.com/a/3580700/5100564。我也借鉴了https://stackoverflow.com/a/6718696/5100564。