Crawler4j不适用于https网址

时间:2014-07-09 16:05:48

标签: grails https web-crawler crawler4j

我正在使用crawler4j开发一个grails应用程序。

我知道这是一个老问题,我偶然发现了solution here

我尝试了提供的解决方案,但不知道在哪里保留另一个fetcher和mockssl java文件。

另外,我不确定在包含https:// ...

的网址的情况下如何调用这两个类

提前致谢。

1 个答案:

答案 0 :(得分:0)

解决方案运行良好。也许你有一些问题要推断出放置代码的位置。以下是我如何使用它:

创建抓取工具时,您的主要课程中会出现类似这样的内容,如official documentation所示:

public class Controller {
public static void main(String[] args) throws Exception {
    CrawlConfig config = new CrawlConfig();
    config.setCrawlStorageFolder(crawlStorageFolder);

    /*
     * Instantiate the controller for this crawl.
     */
     PageFetcher pageFetcher = new MockSSLSocketFactory(config);
     RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
     RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
     CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
    ....

在这里使用MockSSLSocketFactory,其定义如您发布的链接所示:

public class MockSSLSocketFactory extends PageFetcher {

public MockSSLSocketFactory (CrawlConfig config) {
    super(config);

    if (config.isIncludeHttpsPages()) {
        try {
            httpClient.getConnectionManager().getSchemeRegistry().unregister("https");
            httpClient.getConnectionManager().getSchemeRegistry()
                    .register(new Scheme("https", 443, new SimpleSSLSocketFactory()));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
}

如您所见,这里使用的是SimpleSSLSocketFactory类。这可以定义为链接示例中显示的内容:

public class SimpleSSLSocketFactory extends SSLSocketFactory {

public SimpleSSLSocketFactory() throws NoSuchAlgorithmException, KeyManagementException, KeyStoreException,
        UnrecoverableKeyException {
    super(trustStrategy, hostnameVerifier);
}

private static final X509HostnameVerifier hostnameVerifier = new X509HostnameVerifier() {
    @Override
    public void verify(String host, SSLSocket ssl) throws IOException {
        // Do nothing
    }

    @Override
    public void verify(String host, String[] cns, String[] subjectAlts) throws SSLException {
        // Do nothing
    }

    @Override
    public boolean verify(String s, SSLSession sslSession) {
        return true;
    }

    @Override
    public void verify(String arg0, java.security.cert.X509Certificate arg1) throws SSLException {
        // TODO Auto-generated method stub

    }
};

private static final TrustStrategy trustStrategy = new TrustStrategy() {

    @Override
    public boolean isTrusted(java.security.cert.X509Certificate[] arg0, String arg1) throws CertificateException {
        return true;
    }
};

}

正如您所看到的,我只是从官方文档和您发布的链接中复制代码,但我希望看到所有内容对您来说更清晰。