Question

我想创建一个简单的网页抓取工具，以获得乐趣。我需要网络抓取工具来获取一个页面上所有链接的列表。 python库是否有任何内置函数可以使这更容易？感谢任何知识。

Answer 1

BeautifulSoup实际上非常简单。

from BeautifulSoup import BeautifulSoup

[element['href'] for element in BeautifulSoup(document_contents).findAll('a', href=True)]

# [u'http://example.com/', u'/example', ...]

最后一件事：您可以使用urlparse.urljoin将所有网址设为绝对网址。如果您需要链接文字，可以使用element.contents[0]。

之类的内容

以下是你们可能将它们联系在一起的方式：

import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

def get_all_link_targets(url):
    return [urlparse.urljoin(url, tag['href']) for tag in
            BeautifulSoup(urllib2.urlopen(url)).findAll('a', href=True)]

Answer 2

an article使用HTMLParser从网页上的<a>代码中获取网址已经{{3}}。

代码是这样的：

来自HTMLParser的

导入HTMLParser 来自urllib2 import urlopen

class Spider(HTMLParser):

    def __init__(self, url):
        HTMLParser.__init__(self)
        req = urlopen(url)
        self.feed(req.read())

    def handle_starttag(self, tag, attrs):
        if tag == 'a' and attrs:
            print "Found link => %s" % attrs[0][1]

Spider('http://www.python.org')

如果您运行该脚本，您将获得如下输出：

rafe@linux-7o1q:~> python crawler.py
Found link => /
Found link => #left-hand-navigation
Found link => #content-body
Found link => /search
Found link => /about/
Found link => /news/
Found link => /doc/
Found link => /download/
Found link => /community/
Found link => /psf/
Found link => /dev/
Found link => /about/help/
Found link => http://pypi.python.org/pypi
Found link => /download/releases/2.7/
Found link => http://docs.python.org/
Found link => /ftp/python/2.7/python-2.7.msi
Found link => /ftp/python/2.7/Python-2.7.tar.bz2
Found link => /download/releases/3.1.2/
Found link => http://docs.python.org/3.1/
Found link => /ftp/python/3.1.2/python-3.1.2.msi
Found link => /ftp/python/3.1.2/Python-3.1.2.tar.bz2
Found link => /community/jobs/
Found link => /community/merchandise/
Found link => margin-top:1.5em
Found link => margin-top:1.5em
Found link => margin-top:1.5em
Found link => color:#D58228; margin-top:1.5em
Found link => /psf/donations/
Found link => http://wiki.python.org/moin/Languages
Found link => http://wiki.python.org/moin/Languages
Found link => http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics
Found link => http://wiki.python.org/moin/Python2orPython3
Found link => http://pypi.python.org/pypi
Found link => /3kpoll
Found link => /about/success/usa/
Found link => reference
Found link => reference
Found link => reference
Found link => reference
Found link => reference
Found link => reference
Found link => /about/quotes
Found link => http://wiki.python.org/moin/WebProgramming
Found link => http://wiki.python.org/moin/CgiScripts
Found link => http://www.zope.org/
Found link => http://www.djangoproject.com/
Found link => http://www.turbogears.org/
Found link => http://wiki.python.org/moin/PythonXml
Found link => http://wiki.python.org/moin/DatabaseProgramming/
Found link => http://www.egenix.com/files/python/mxODBC.html
Found link => http://sourceforge.net/projects/mysql-python
Found link => http://wiki.python.org/moin/GuiProgramming
Found link => http://wiki.python.org/moin/WxPython
Found link => http://wiki.python.org/moin/TkInter
Found link => http://wiki.python.org/moin/PyGtk
Found link => http://wiki.python.org/moin/PyQt
Found link => http://wiki.python.org/moin/NumericAndScientific
Found link => http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html
Found link => http://www.pentangle.net/python/handbook/
Found link => /community/sigs/current/edu-sig
Found link => http://www.openbookproject.net/pybiblio/
Found link => http://osl.iu.edu/~lums/swc/
Found link => /about/apps
Found link => http://docs.python.org/howto/sockets.html
Found link => http://twistedmatrix.com/trac/
Found link => /about/apps
Found link => http://buildbot.net/trac
Found link => http://www.edgewall.com/trac/
Found link => http://roundup.sourceforge.net/
Found link => http://wiki.python.org/moin/IntegratedDevelopmentEnvironments
Found link => /about/apps
Found link => http://www.pygame.org/news.html
Found link => http://www.alobbs.com/pykyra
Found link => http://www.vrplumber.com/py3d.py
Found link => /about/apps
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => reference external
Found link => /channews.rdf
Found link => /about/website
Found link => http://www.xs4all.com/
Found link => http://www.timparkin.co.uk/
Found link => /psf/
Found link => /about/legal

然后您可以使用正则表达式来区分绝对URL和相对URL。

Answer 3

使用libxml完成解决方案。

import urllib
import libxml2
parse_opts = libxml2.HTML_PARSE_RECOVER + \
            libxml2.HTML_PARSE_NOERROR + \
            libxml2.HTML_PARSE_NOWARNING

doc = libxml2.htmlReadDoc(urllib.urlopen(url).read(), '', None, parse_opts)
print [ i.getContent() for i in doc.xpathNewContext().xpathEval("//a/@href") ]

使用python在网页上提取URL列表的简单方法是什么？

3 个答案: