我需要抓取网站进行一些检查,以确定网址是否可用。为此,我使用crawler4j。
我的问题来自一些网页,这些网页已禁用具有<meta name="robots" content="noindex,nofollow" />
的机器人,因为它具有内容,因此无法在搜索引擎中对此网页编制索引。
尽管禁用了RobotServer的配置,但crawler4j也没有关注这些链接。使用robotstxtConfig.setEnabled(false);
:
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setUserAgentName(USER_AGENT_NAME);
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
WebCrawlerController controller = new WebCrawlerController(config, pageFetcher, robotstxtServer);
...
但所描述的网页仍然没有被探索过。我已经阅读了代码,这必须足以禁用机器人指令,但它没有按预期工作。也许我正在跳过某些东西?我使用版本3.5
和3.6-SNAPSHOT
对其进行了测试,结果相同。
答案 0 :(得分:1)
我正在使用新版本
<dependency>
<groupId>edu.uci.ics</groupId>
<artifactId>crawler4j</artifactId>
<version>4.1</version>
</dependency>`
像这样设置RobotstxtConfig后,它正在运行:
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setEnabled(false);
来自Crawler4J的测试结果和源代码证明:
public boolean allows(WebURL webURL) {
if (config.isEnabled()) {
try {
URL url = new URL(webURL.getURL());
String host = getHost(url);
String path = url.getPath();
HostDirectives directives = host2directivesCache.get(host);
if ((directives != null) && directives.needsRefetch()) {
synchronized (host2directivesCache) {
host2directivesCache.remove(host);
directives = null;
}
}
if (directives == null) {
directives = fetchDirectives(url);
}
return directives.allows(path);
} catch (MalformedURLException e) {
logger.error("Bad URL in Robots.txt: " + webURL.getURL(), e);
}
}
return true;
}
如果将Enabled设置为false,则不再进行检查。
答案 1 :(得分:0)
为什么不在crawler4j中排除有关Robotstxt的所有内容?我需要抓取一个网站并忽略机器人,这对我有用。
我在.crawler中更改了CrawlController和WebCrawler:
<强> WebCrawler.java:强>
删除强>
private RobotstxtServer robotstxtServer;
删除强>
this.robotstxtServer = crawlController.getRobotstxtServer();
修改强>
if ((shouldVisit(webURL)) && (this.robotstxtServer.allows(webURL)))
-->
if ((shouldVisit(webURL)))
修改强>
if (((maxCrawlDepth == -1) || (curURL.getDepth() < maxCrawlDepth)) &&
(shouldVisit(webURL)) && (this.robotstxtServer.allows(webURL)))
-->
if (((maxCrawlDepth == -1) || (curURL.getDepth() < maxCrawlDepth)) &&
(shouldVisit(webURL)))
<强> CrawlController.java:强>
删除强>
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
删除
protected RobotstxtServer robotstxtServer;
修改强>
public CrawlController(CrawlConfig config, PageFetcher pageFetcher, RobotstxtServer robotstxtServer) throws Exception
-->
public CrawlController(CrawlConfig config, PageFetcher pageFetcher) throws Exception
删除强>
this.robotstxtServer = robotstxtServer;
修改强>
if (!this.robotstxtServer.allows(webUrl))
{
logger.info("Robots.txt does not allow this seed: " + pageUrl);
}
else
{
this.frontier.schedule(webUrl);
}
-->
this.frontier.schedule(webUrl);
删除强>
public RobotstxtServer getRobotstxtServer()
{
return this.robotstxtServer;
}
public void setRobotstxtServer(RobotstxtServer robotstxtServer)
{
this.robotstxtServer = robotstxtServer;
}
希望它是您正在寻找的东西。