如何仅解析Python中网页的链接?

时间:2014-11-30 04:54:31

标签: python html regex html-parsing

links = re.findall(r'\w+://\w+.\w+.\w+\w+\w.+"', page) 

从网页解析链接。

请任何帮助将不胜感激。这是我从解析http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html获得的:

        #my current output#
        http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/"
        http://www.asecuritysite.com/content/icon_clown.gif" alt="if broken see alex@school.ac.uk +44(0)1314552759" height="100"
        http://www.rottentomatoes.com/m/sleeper/"
        http://www.rottentomatoes.com/m/sleeper/trailer/"
        http://www.rottentomatoes.com/m/star_wars/"
        http://www.rottentomatoes.com/m/star_wars/trailer/"
        http://www.rottentomatoes.com/m/wargames/"
        http://www.rottentomatoes.com/m/wargames/trailer/"
        https://www.sans.org/press/sans-institute-and-crowdstrike-partner-to-offer-hacking-exposed-live-webinar-series.php"> SANS to Offer "Hacking Exposed Live"
        https://www.sans.org/webcasts/archive/2013"

        #I want to get this when i run the module#
        http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/
        http://www.asecuritysite.com/content/icon_clown.gif
        http://www.rottentomatoes.com/m/sleeper/
        http://www.rottentomatoes.com/m/sleeper/trailer/
        http://www.rottentomatoes.com/m/star_wars/
        http://www.rottentomatoes.com/m/star_wars/trailer/
        http://www.rottentomatoes.com/m/wargames/
        http://www.rottentomatoes.com/m/wargames/trailer/
        https://www.sans.org/press/sans-institute-and-crowdstrike-partner-to-offer-hacking-exposed-live-webinar-series.php
        https://www.sans.org/webcasts/archive/2013

3 个答案:

答案 0 :(得分:1)

You should not use regular expressions for parsing HTML.有一些名为 HTML解析器的专业工具

以下是使用BeautifulSouprequests的示例:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html')
soup = BeautifulSoup(page.content)

for link in soup.find_all('a', href=True):
    print link.get('href')

打印:

http://www.rottentomatoes.com/m/sleeper/
http://www.rottentomatoes.com/m/sleeper/trailer/
http://www.rottentomatoes.com/m/wargames/
http://www.rottentomatoes.com/m/wargames/trailer/
...

答案 1 :(得分:0)

\w+://\w+\.\w+\.\w+[^"]+

试试这个。看看演示。

http://regex101.com/r/hQ9xT1/31

答案 2 :(得分:0)

通过Beautifulsoup CSS selectors

>>> from bs4 import BeautifulSoup
>>> import requests
>>> page = requests.get('http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html')
>>> soup = BeautifulSoup(page.content)
>>> for i in soup.select('a[href]'):
        print(i['href'])

http://www.rottentomatoes.com/m/sleeper/
http://www.rottentomatoes.com/m/sleeper/trailer/
http://www.rottentomatoes.com/m/wargames/
http://www.rottentomatoes.com/m/wargames/trailer/
..................