Question

我正在编写一个抓取工具，为此我正在实现robots.txt解析器，我正在使用标准的lib robotparser 。

似乎 robotparser 正确解析，我正在使用Google的robots.txt调试我的抓取工具。

（以下示例来自IPython）

In [1]: import robotparser

In [2]: x = robotparser.RobotFileParser()

In [3]: x.set_url("http://www.google.com/robots.txt")

In [4]: x.read()

In [5]: x.can_fetch("My_Crawler", "/catalogs") # This should return False, since it's on Disallow
Out[5]: False

In [6]: x.can_fetch("My_Crawler", "/catalogs/p?") # This should return True, since it's Allowed
Out[6]: False

In [7]: x.can_fetch("My_Crawler", "http://www.google.com/catalogs/p?")
Out[7]: False

这很有趣，因为有时似乎“工作”，有时它似乎失败，我也尝试过与Facebook和Stackoverflow的robots.txt相同。这是来自robotpaser模块的错误吗？或者我在这里做错了什么？如果是这样，是什么？

我想知道this bug是否有任何相关的内容

Answer 1

这不是错误，而是解释上的差异。根据{{3}}（从未批准，也不可能）：

要评估是否允许访问URL，机器人必须尝试匹配允许和禁止行中的路径与URL中的路径匹配它们出现在记录中。找到的第一个匹配项是使用的。如果不找到匹配，默认假设是允许URL。

（第3.2.2节，允许和禁止行）

使用该解释，然后“/ catalogs / p？”应该被拒绝，因为之前有一个“Disallow：/ catalogs”指令。

在某些时候，Google开始以不同于该规范的方式解释robots.txt。他们的方法似乎是：

Check for Allow. If it matches, crawl the page.
Check for Disallow. If it matches, don't crawl.
Otherwise, crawl.

问题在于robots.txt的解释没有正式的协议。我见过使用Google方法的爬虫和其他使用1996年草案标准的爬虫。当我操作爬虫时，当我使用Google解释时，我从网站管理员那里得到了令人讨厌的东西，因为我抓了他们认为不应该抓取的页面，如果我使用其他解释，我会从其他人那里得到令人讨厌的东西，因为他们认为应该将其编入索引，但不是。

Answer 2

经过几次Google搜索后，我找不到任何有关 robotparser 问题的信息。我最后得到了一些东西，我找到了一个名为 reppy 的模块，我做了一些测试，它看起来非常强大。您可以通过 pip ;

安装它

pip install reppy

以下是使用 reppy 的一些示例（在IPython上），再次使用Google的robots.txt

In [1]: import reppy

In [2]: x = reppy.fetch("http://google.com/robots.txt")

In [3]: x.atts
Out[3]: 
{'agents': {'*': <reppy.agent at 0x1fd9610>},
 'sitemaps': ['http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xml',
  'http://www.google.com/hostednews/sitemap_index.xml',
  'http://www.google.com/sitemaps_webmasters.xml',
  'http://www.google.com/ventures/sitemap_ventures.xml',
  'http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml',
  'http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xml',
  'http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml',
  'http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml']}

In [4]: x.allowed("/catalogs/about", "My_crawler") # Should return True, since it's allowed.
Out[4]: True

In [5]: x.allowed("/catalogs", "My_crawler") # Should return False, since it's not allowed.
Out[5]: False

In [7]: x.allowed("/catalogs/p?", "My_crawler") # Should return True, since it's allowed.
Out[7]: True

In [8]: x.refresh() # Refresh robots.txt, perhaps a magic change?

In [9]: x.ttl
Out[9]: 3721.3556718826294

In [10]: # It also has a x.disallowed function. The contrary of x.allowed

Answer 3

有趣的问题。我查看了源代码（我只有python 2.4源代码可用，但我敢打赌它没有改变）并且代码通过执行来标准化正在测试的url：

urllib.quote(urlparse.urlparse(urllib.unquote(url))[2])

这是你问题的根源：

>>> urllib.quote(urlparse.urlparse(urllib.unquote("/foo"))[2]) 
'/foo'
>>> urllib.quote(urlparse.urlparse(urllib.unquote("/foo?"))[2]) 
'/foo'

所以这是python库中的一个错误，或谷歌通过包含“？”来打破robot.txt规范。规则中的字符（这有点不寻常）。

[以防万一不清楚，我会以不同的方式再次说出来。上面的代码被robotparser库用作检查url的一部分。所以当网址以“？”结尾时该角色被删除。因此，当您检查/catalogs/p?时，执行的实际测试是/catalogs/p。因此你的结果令人惊讶。]

我建议使用python人员filing a bug（您可以在此处发布链接作为解释的一部分）[编辑：感谢]。然后使用您找到的其他库...

Answer 4

大约一周前，我们合并了一个提交，其中包含导致此问题的错误。我们只是将版本0.2.2推到po并掌握了repo，包括对这个问题的回归测试。

版本0.2包含轻微的界面更改 - 现在您必须创建一个RobotsCache对象，其中包含reppy最初具有的确切界面。这主要是为了使缓存显式化，并使得在同一进程中拥有不同的缓存成为可能。但是，它现在又有效了！

from reppy.cache import RobotsCache
cache = RobotsCache()
cache.allowed('http://www.google.com/catalogs', 'foo')
cache.allowed('http://www.google.com/catalogs/p', 'foo')
cache.allowed('http://www.google.com/catalogs/p?', 'foo')

Robotparser似乎没有正确解析

4 个答案: