Question

我需要抓取网站进行一些检查，以确定网址是否可用。为此，我使用crawler4j。

我的问题来自一些网页，这些网页已禁用具有<meta name="robots" content="noindex,nofollow" />的机器人，因为它具有内容，因此无法在搜索引擎中对此网页编制索引。

尽管禁用了RobotServer的配置，但crawler4j也没有关注这些链接。使用robotstxtConfig.setEnabled(false);：

，这一定非常简单

RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setUserAgentName(USER_AGENT_NAME);
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
WebCrawlerController controller = new WebCrawlerController(config, pageFetcher, robotstxtServer);
...

但所描述的网页仍然没有被探索过。我已经阅读了代码，这必须足以禁用机器人指令，但它没有按预期工作。也许我正在跳过某些东西？我使用版本3.5和3.6-SNAPSHOT对其进行了测试，结果相同。

Answer 1

我正在使用新版本

   <dependency>
        <groupId>edu.uci.ics</groupId>
        <artifactId>crawler4j</artifactId>
        <version>4.1</version>
    </dependency>`

像这样设置RobotstxtConfig后，它正在运行：

    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
    robotstxtConfig.setEnabled(false);

来自Crawler4J的测试结果和源代码证明：

public boolean allows(WebURL webURL) {
if (config.isEnabled()) {
  try {
    URL url = new URL(webURL.getURL());
    String host = getHost(url);
    String path = url.getPath();

    HostDirectives directives = host2directivesCache.get(host);

    if ((directives != null) && directives.needsRefetch()) {
      synchronized (host2directivesCache) {
        host2directivesCache.remove(host);
        directives = null;
      }
    }

    if (directives == null) {
      directives = fetchDirectives(url);
    }

    return directives.allows(path);
  } catch (MalformedURLException e) {
    logger.error("Bad URL in Robots.txt: " + webURL.getURL(), e);
  }
}

return true;
}

如果将Enabled设置为false，则不再进行检查。

Answer 2

为什么不在crawler4j中排除有关Robotstxt的所有内容？我需要抓取一个网站并忽略机器人，这对我有用。

我在.crawler中更改了CrawlController和WebCrawler：

<强> WebCrawler.java：

删除

private RobotstxtServer robotstxtServer;

删除

this.robotstxtServer = crawlController.getRobotstxtServer();

修改

if ((shouldVisit(webURL)) && (this.robotstxtServer.allows(webURL))) --> if ((shouldVisit(webURL)))

修改

if (((maxCrawlDepth == -1) || (curURL.getDepth() < maxCrawlDepth)) && (shouldVisit(webURL)) && (this.robotstxtServer.allows(webURL))) --> if (((maxCrawlDepth == -1) || (curURL.getDepth() < maxCrawlDepth)) && (shouldVisit(webURL)))

<强> CrawlController.java：

删除

import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;

删除

protected RobotstxtServer robotstxtServer;

修改

public CrawlController(CrawlConfig config, PageFetcher pageFetcher, RobotstxtServer robotstxtServer) throws Exception --> public CrawlController(CrawlConfig config, PageFetcher pageFetcher) throws Exception

删除

this.robotstxtServer = robotstxtServer;

修改

if (!this.robotstxtServer.allows(webUrl)) { logger.info("Robots.txt does not allow this seed: " + pageUrl); } else { this.frontier.schedule(webUrl); } --> this.frontier.schedule(webUrl);

删除

public RobotstxtServer getRobotstxtServer() { return this.robotstxtServer; } public void setRobotstxtServer(RobotstxtServer robotstxtServer) { this.robotstxtServer = robotstxtServer; }

希望它是您正在寻找的东西。

在crawler4j中禁用RobotServer

2 个答案: