Question

我正在撰写一个非常简单的网络抓取工具，并尝试解析'robots.txt'个文件。我在标准库中找到了robotparser模块，它应该做到这一点。我正在使用Python 2.7.2。不幸的是，我的代码无法正确加载'robots.txt'文件，我无法弄清楚原因。

以下是我的代码的相关摘要：

from urlparse import urlparse, urljoin
import robotparser

def get_all_links(page, url):
    links = []
    page_url = urlparse(url)
    base = page_url[0] + '://' + page_url[1]
    robots_url = urljoin(base, '/robots.txt')
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    for link in page.find_all('a'):
        link_url = link.get('href')
        print "Found a link: ", link_url
        if not rp.can_fetch('*', link_url):
            print "Page off limits!" 
            pass

此处page是已解析的BeautifulSoup对象，url是存储为字符串的网址。解析器读取空白'robots.txt'文件，而不是指定URL处的文件，并将True返回到所有can_fetch()个查询。它看起来好像没有打开URL或者没有读取文本文件。

我也在交互式翻译中尝试过它。这就是使用与documentation页面相同的语法。

Python 2.7.2 (default, Aug 18 2011, 18:04:39) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import robotparser
>>> url = 'http://www.udacity-forums.com/robots.txt'
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url(url)
>>> rp.read()
>>> print rp

>>>

行print rp应该打印'robots.txt'文件的内容，但它返回空白。更令人沮丧的是，these examples在编写时都表现得非常好，但在尝试使用自己的网址时失败了。我是Python的新手，我无法弄清楚出了什么问题。据我所知，我使用的模块与文档和示例相同。谢谢你的帮助！

更新1：如果print rp不是检查是否已读入'robots.txt'的好方法，那么以下是解释器中的几行。{{{ 1}}，path和host属性是正确的，但url中的条目仍未被读入。

'robots.txt'

更新2 ：我已使用this external library解析>>> rp <robotparser.RobotFileParser instance at 0x1004debd8> >>> dir(rp) ['__doc__', '__init__', '__module__', '__str__', '_add_entry', 'allow_all', 'can_fetch', 'default_entry', 'disallow_all', 'entries', 'errcode', 'host', 'last_checked', 'modified', 'mtime', 'parse', 'path', 'read', 'set_url', 'url'] >>> rp.path '/robots.txt' >>> rp.host 'www.udacity-forums.com' >>> rp.entries [] >>> rp.url 'http://www.udacity-forums.com/robots.txt' >>>个文件解决了这个问题。（但我没有回答原来的问题！）在终端上花了一些时间之后，我最好的猜测是'robots.txt'无法处理robotparser规范的某些新增内容，例如'robots.txt' ，并有空行的麻烦。它将读取文件，例如Stack Overflow和Python.org，但不包括Google，YouTube或我的原始Udacity文件，其中包含Sitemap语句和空行。如果比我聪明的人能证实或解释这个，我仍然会感激不尽！

Answer 1

我通过使用此外部库解析'robots.txt'文件解决了这个问题。（但我还没回答原来的问题！）在终端上花了一些时间之后，我最好的猜测是机器人分析器无法处理“robots.txt”规范的某些新增功能，例如Sitemap，并且空白行有问题。它将读取文件，例如Stack Overflow和Python.org，但不包括Google，YouTube或我的原始Udacity文件，其中包括Sitemap语句和空行。如果比我更聪明的人能证实或解释这个，我仍然会感激不尽！

Answer 2

解决方案可能是使用reppy模块

pip install reppy

以下是一些例子;

In [1]: import reppy

In [2]: x = reppy.fetch("http://google.com/robots.txt")

In [3]: x.atts
Out[3]: 
{'agents': {'*': <reppy.agent at 0x1fd9610>},
 'sitemaps': ['http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xml',
  'http://www.google.com/hostednews/sitemap_index.xml',
  'http://www.google.com/sitemaps_webmasters.xml',
  'http://www.google.com/ventures/sitemap_ventures.xml',
  'http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml',
  'http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xml',
  'http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml',
  'http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml']}

In [4]: x.allowed("/catalogs/about", "My_crawler") # Should return True, since it's allowed.
Out[4]: True

In [5]: x.allowed("/catalogs", "My_crawler") # Should return False, since it's not allowed.
Out[5]: False

In [7]: x.allowed("/catalogs/p?", "My_crawler") # Should return True, since it's allowed.
Out[7]: True

In [8]: x.refresh() # Refresh robots.txt, perhaps a magic change?

In [9]: x.ttl
Out[9]: 3721.3556718826294

瞧！

Python robotparser模块不会加载'robots.txt'

2 个答案: