Question

我有以下robots.txt

User-agent: *
Disallow: /images/
Sitemap: http://www.example.com/sitemap.xml

以及以下robotparser

def init_robot_parser(URL):
    robot_parser = robotparser.RobotFileParser()
    robot_parser.set_url(urlparse.urljoin(URL, "robots.txt"))
    robot_parser.read()

    return robot_parser

但是当我print robot_parser以上return robot_parser时，我得到的就是

User-agent: *
Disallow: /images/

为什么忽略Sitemap行，我错过了什么？

Answer 1

Sitemap是标准的扩展，robotparser不支持它。您可以在the source中看到它只处理“user-agent”，“disallow”和“allow”。对于其当前功能（告诉您是否允许特定URL），不需要了解Sitemap。

Answer 2

您可以使用Repply（https://github.com/seomoz/reppy）来解析Robots.txt - 它会处理站点地图。

请注意，在某些情况下，默认位置（/sitemaps.xml）上有一个站点地图，并且站点所有者未在robots.txt中提及它（例如在toucharcade.com上）

我还发现至少有一个站点压缩了站点地图 - 这就是robot.txt会导致.gz文件。

Python的robotparser忽略了站点地图

2 个答案: