我想创建一个简单的网页抓取工具,以获得乐趣。我需要网络抓取工具来获取一个页面上所有链接的列表。 python库是否有任何内置函数可以使这更容易?感谢任何知识。
答案 0 :(得分:7)
BeautifulSoup实际上非常简单。
from BeautifulSoup import BeautifulSoup
[element['href'] for element in BeautifulSoup(document_contents).findAll('a', href=True)]
# [u'http://example.com/', u'/example', ...]
最后一件事:您可以使用urlparse.urljoin
将所有网址设为绝对网址。如果您需要链接文字,可以使用element.contents[0]
。
以下是你们可能将它们联系在一起的方式:
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
def get_all_link_targets(url):
return [urlparse.urljoin(url, tag['href']) for tag in
BeautifulSoup(urllib2.urlopen(url)).findAll('a', href=True)]
答案 1 :(得分:0)
an article使用HTMLParser从网页上的<a>
代码中获取网址已经{{3}}。
代码是这样的:
来自HTMLParser的导入HTMLParser 来自urllib2 import urlopen
class Spider(HTMLParser):
def __init__(self, url):
HTMLParser.__init__(self)
req = urlopen(url)
self.feed(req.read())
def handle_starttag(self, tag, attrs):
if tag == 'a' and attrs:
print "Found link => %s" % attrs[0][1]
Spider('http://www.python.org')
如果您运行该脚本,您将获得如下输出:
rafe@linux-7o1q:~> python crawler.py Found link => / Found link => #left-hand-navigation Found link => #content-body Found link => /search Found link => /about/ Found link => /news/ Found link => /doc/ Found link => /download/ Found link => /community/ Found link => /psf/ Found link => /dev/ Found link => /about/help/ Found link => http://pypi.python.org/pypi Found link => /download/releases/2.7/ Found link => http://docs.python.org/ Found link => /ftp/python/2.7/python-2.7.msi Found link => /ftp/python/2.7/Python-2.7.tar.bz2 Found link => /download/releases/3.1.2/ Found link => http://docs.python.org/3.1/ Found link => /ftp/python/3.1.2/python-3.1.2.msi Found link => /ftp/python/3.1.2/Python-3.1.2.tar.bz2 Found link => /community/jobs/ Found link => /community/merchandise/ Found link => margin-top:1.5em Found link => margin-top:1.5em Found link => margin-top:1.5em Found link => color:#D58228; margin-top:1.5em Found link => /psf/donations/ Found link => http://wiki.python.org/moin/Languages Found link => http://wiki.python.org/moin/Languages Found link => http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics Found link => http://wiki.python.org/moin/Python2orPython3 Found link => http://pypi.python.org/pypi Found link => /3kpoll Found link => /about/success/usa/ Found link => reference Found link => reference Found link => reference Found link => reference Found link => reference Found link => reference Found link => /about/quotes Found link => http://wiki.python.org/moin/WebProgramming Found link => http://wiki.python.org/moin/CgiScripts Found link => http://www.zope.org/ Found link => http://www.djangoproject.com/ Found link => http://www.turbogears.org/ Found link => http://wiki.python.org/moin/PythonXml Found link => http://wiki.python.org/moin/DatabaseProgramming/ Found link => http://www.egenix.com/files/python/mxODBC.html Found link => http://sourceforge.net/projects/mysql-python Found link => http://wiki.python.org/moin/GuiProgramming Found link => http://wiki.python.org/moin/WxPython Found link => http://wiki.python.org/moin/TkInter Found link => http://wiki.python.org/moin/PyGtk Found link => http://wiki.python.org/moin/PyQt Found link => http://wiki.python.org/moin/NumericAndScientific Found link => http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html Found link => http://www.pentangle.net/python/handbook/ Found link => /community/sigs/current/edu-sig Found link => http://www.openbookproject.net/pybiblio/ Found link => http://osl.iu.edu/~lums/swc/ Found link => /about/apps Found link => http://docs.python.org/howto/sockets.html Found link => http://twistedmatrix.com/trac/ Found link => /about/apps Found link => http://buildbot.net/trac Found link => http://www.edgewall.com/trac/ Found link => http://roundup.sourceforge.net/ Found link => http://wiki.python.org/moin/IntegratedDevelopmentEnvironments Found link => /about/apps Found link => http://www.pygame.org/news.html Found link => http://www.alobbs.com/pykyra Found link => http://www.vrplumber.com/py3d.py Found link => /about/apps Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => reference external Found link => /channews.rdf Found link => /about/website Found link => http://www.xs4all.com/ Found link => http://www.timparkin.co.uk/ Found link => /psf/ Found link => /about/legal
然后您可以使用正则表达式来区分绝对URL和相对URL。
答案 2 :(得分:0)
使用libxml完成解决方案。
import urllib
import libxml2
parse_opts = libxml2.HTML_PARSE_RECOVER + \
libxml2.HTML_PARSE_NOERROR + \
libxml2.HTML_PARSE_NOWARNING
doc = libxml2.htmlReadDoc(urllib.urlopen(url).read(), '', None, parse_opts)
print [ i.getContent() for i in doc.xpathNewContext().xpathEval("//a/@href") ]