我需要python正则表达式从html中提取url, 示例HTML代码:
<a href=""http://a0c5e.site.it/r"" target=_blank><font color=#808080>MailUp</font></a>
<a href=""http://www.site.it/prodottiLLPP.php?id=1"" class=""txtBlueGeorgia16"">Prodotti</a>
<a href=""http://www.site.it/terremoto.php"" target=""blank"" class=""txtGrigioScuroGeorgia12"">Terremoto</a>
<a class='mini' href='http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse'>clicca qui.</a>`
我只需要提取物:
http://a0c5e.site.it/r
http://www.site.it/prodottiLLPP.php?id=1
http://www.site.it/terremoto.php
http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse
答案 0 :(得分:2)
正则表达式可能会解决您的问题,但请考虑使用BeautifulSoup
>>> html = """<a href="http://a0c5e.site.it/r" target=_blank><font color=#808080>MailUp</font></a>
<a href="http://www.site.it/prodottiLLPP.php?id=1" class=""txtBlueGeorgia16"">Prodotti</a>
<a href="http://www.site.it/terremoto.php" target=""blank"" class=""txtGrigioScuroGeorgia12"">Terremoto</a>
<a class='mini' href='http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse'>clicca qui.</a>`"""
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> [e['href'] for e in soup.findAll('a')]
[u'http://a0c5e.site.it/r', u'http://www.site.it/prodottiLLPP.php?id=1', u'http://www.site.it/terremoto.php', u'http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse']
来自Jon Clements
soup.findAll('a', {'href': True})
另一方面,您的html代码段中的href配额不正确。
答案 1 :(得分:1)
观察
Python 2.7.3 (default, Sep 4 2012, 20:19:03)
[GCC 4.2.1 20070831 patched [FreeBSD]] on freebsd9
Type "help", "copyright", "credits" or "license" for more information.
>>> junk=''' <a href=""http://a0c5e.site.it/r"" target=_blank><font color=#808080>MailUp</font></a>
... <a href=""http://www.site.it/prodottiLLPP.php?id=1"" class=""txtBlueGeorgia16"">Prodotti</a>
... <a href=""http://www.site.it/terremoto.php"" target=""blank"" class=""txtGrigioScuroGeorgia12"">Terremoto</a>
... <a class='mini' href='http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse'>clicca qui.</a>`'''
>>> import re
>>> pat=re.compile(r'''http[\:/a-zA-Z0-9\.\?\=&]*''')
>>> pat.findall(junk)
['http://a0c5e.site.it/r', 'http://www.site.it/prodottiLLPP.php?id=1', 'http://www.site.it/terremoto.php', 'http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse']
可能想要添加%,以便您可以捕获其他转义。
答案 2 :(得分:0)
您可以使用BeautifulSoup library来操纵/提取有关HTML的信息。
我不建议您使用正则表达式来解析HTML数据。 HTML不是常规的,它是无上下文的语法。当链接结构发生更改时,HTML可能有效,但您的正则表达式可能无效,您将不得不再次编写表达式。使用BeautifulSoup是提取信息的一种不错的方式。