正则表达式将html链接替换为纯文本URL

时间:2010-02-24 00:33:17

标签: html regex

我需要替换html中的链接:

<a href="http://example.com"></a>

只是纯文本网址:

http://example.com

UPD。这里有一些说明,我需要这个从文本中删除html标签但保留链接位置。它纯粹是供内部使用,所以不会有任何疯狂的边缘代码。 在这种情况下,语言是python,但我不知道它是如何相关的。

3 个答案:

答案 0 :(得分:2)

如前所述,如果您对某些错误感到满意和/或对输入有一定程度的控制,您可以在完整性方面做出一些妥协并使用正则表达式。由于您的更新说明情况属实,这里的正则表达式应该适用于您:

/<a\s(?:.(?!=href))*?href="([^"]*)"[^>]*?>(.*?)</a>/gi
  • $ 1 :HREF
  • $ 2 :标记内的所有内容。

这将处理下面的所有测试用例,除了最后三行:

Hello this is some text <a href="/test">This is a link</a> and this is some more text.
<a href="/test">Just a link on this line.</a>
There are <a href="/test">two links </a> on <a href="http://www.google.com">this line</a>!
Now we need to test some <a href="http://www.google.com" class="test">other attributes.</a>. They can be <a class="test" href="http://www.google.com">before</a> or after.
Or they can be <a rel="nofollow" href="http://www.google.com" class="myclass">both</a>
Also we need to deal with <a href="/test" class="myclass" style=""><span class="something">Nested tags and empty attributes</span></a>.
Make sure that we don't do anything with <a name="marker">anchors with no href</a>
Make sure we skip other <address href="/test">tags that start with a even if they are closed with an a</a>
Lastly try some other <a href="#">types</a> of <a href="">href</a> attributes.

Also we need to skip <a malformed tags.  </a>.  But <a href="#">this</a> is where regex fails us.
We will also fail if the user has used <a href='javascript:alert("the reason"))'>single quotes for some reason</a>
Other invalid HTML such as <a href="/link1" href="/link2">links with two hrefs</a> will have problems for obvious reasons.

答案 1 :(得分:1)

>>> s="""blah <a href="http://example.com"></a> blah <a href="http://www.google.com">test</a>"""
>>> import re
>>> pat=re.compile("<a\s+href=\"(.*?)\">.*?</a>",re.M|re.DOTALL|re.I)
>>> pat.findall(s)
['http://example.com', 'http://www.google.com']
>>> pat.sub("\\1",s)
'blah http://example.com blah http://www.google.com'

对于更复杂的操作,请使用BeautifulSoup

答案 2 :(得分:0)

您可以尝试将unlink与minidom

一起使用,而不是使用正则表达式