我想编写一个带有网页URL的函数,下载网页并返回该页面中的URL列表。(使用urllib模块) 任何帮助将不胜感激
答案 0 :(得分:5)
你走了:
import sys
import urllib2
import lxml.html
try:
url = sys.argv[1]
except IndexError:
print "Specify a url to scrape"
sys.exit(1)
if not url.startswith("http://"):
print "Please include the http:// at the beginning of the url"
sys.exit(1)
html = urllib2.urlopen(url).read()
etree = lxml.html.fromstring(html)
for href in etree.xpath("//a/@href"):
print href
C:\Programming>getlinks.py http://example.com / /domains/ /numbers/ /protocols/ /about/ /go/rfc2606 /about/ /about/presentations/ /about/performance/ /reports/ /domains/ /domains/root/ /domains/int/ /domains/arpa/ /domains/idn-tables/ /protocols/ /numbers/ /abuse/ http://www.icann.org/ mailto:iana@iana.org?subject=General%20website%20feedback