从网页中提取所有链接

时间:2011-05-01 11:13:09

标签: python

我想编写一个带有网页URL的函数,下载网页并返回该页面中的URL列表。(使用urllib模块) 任何帮助将不胜感激

1 个答案:

答案 0 :(得分:5)

你走了:

import sys
import urllib2
import lxml.html

try:
    url = sys.argv[1]
except IndexError:
    print "Specify a url to scrape"
    sys.exit(1)

if not url.startswith("http://"):
    print "Please include the http:// at the beginning of the url"
    sys.exit(1)

html = urllib2.urlopen(url).read()
etree = lxml.html.fromstring(html)

for href in etree.xpath("//a/@href"):
    print href

C:\Programming>getlinks.py http://example.com
/
/domains/
/numbers/
/protocols/
/about/
/go/rfc2606
/about/
/about/presentations/
/about/performance/
/reports/
/domains/
/domains/root/
/domains/int/
/domains/arpa/
/domains/idn-tables/
/protocols/
/numbers/
/abuse/
http://www.icann.org/
mailto:iana@iana.org?subject=General%20website%20feedback