Question

有人可以告诉我示例代码来检查网址是否被robots.txt阻止了吗？我们可以在robots.txt中指定完整的URL或目录。 Perl中有任何辅助函数吗？

Answer 1

结帐WWW::RobotRules：

   The following methods are provided:

   $rules = WWW::RobotRules->new($robot_name)
  This is the constructor for WWW::RobotRules objects.  The first
  argument given to new() is the name of the robot.

   $rules->parse($robot_txt_url, $content, $fresh_until)
  The parse() method takes as arguments the URL that was used to
  retrieve the /robots.txt file, and the contents of the file.

   $rules->allowed($uri)
  Returns TRUE if this robot is allowed to retrieve this URL.

Answer 2

WWW::RobotRules是解析robots.txt文件然后检查网址是否被阻止的标准类。

您可能还对LWP::RobotUA感兴趣，LWP::UserAgent会将其整合到{{3}}中，根据需要自动提取和检查robots.txt文件。

Answer 3

加载robots.txt文件并在文件中搜索“Disallow：”。然后检查以下模式（Disallow :)之后是否在您的URL中。如果是这样，则robots.txt

会禁止该网址

示例 - 您可以在robots.txt中找到以下行：

禁止：/ cgi-bin /

现在删除“Disallow：”并检查“/ cgi-bin /”（其余部分）是否直接位于TLD之后。

如果您的网址如下：

www.stackoverflow.com/cgi-bin/somwhatelse.pl

它被禁止了。

如果您的网址如下：

www.stackoverflow.com/somwhatelse.pl

没关系。您可以在http://www.robotstxt.org/找到完整的规则集。如果您因任何原因无法安装其他模块，这就是这种方式。

最好是使用cpan中的模块：我在cpan上有一个很棒的模块来处理它：LWP::RobotUA。 LWP（libwww）是perl中webaccess的标准 - 这个模块是它的一部分，确保你的行为很好。

Answer 4

嗯，你好像甚至没看过！在search results的第一页上，我看到了各种下载引擎，可以自动为您处理robots.txt，至少有一个可以完全按照您的要求处理。

Answer 5

WWW :: RobotRules跳过规则“ substring ”

User-agent: *
Disallow: *anytext*

url http://example.com/some_anytext.html被传递（未被禁止）

使用Perl检查robots.txt是否阻止了网址

5 个答案: