使用HtmlUnit 2.18进行抓取网页时出错

时间:2015-10-31 02:33:57

标签: java web-scraping web-crawler htmlunit

我有以下代码:

WebClient webClient = new WebClient(BrowserVersion.getDefault());
HtmlPage page;
List<HtmlAnchor> anchor=new ArrayList<HtmlAnchor>();

try {
    System.out.println("Querying");
    page = webClient.getPage("https://www.amazon.com/gp/goldbox");
    anchor = page.getAnchors();
    for(HtmlAnchor s:anchor)
    {
      System.out.println(s.getAttribute("href"));
    }
    System.out.println("Success");
}

查询

Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE
    at org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:52)
    at org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<init>(DefaultHttpRequestWriterFactory.java:56)
    at org.apache.http.impl.io.DefaultHttpRequestWriterFactory.<clinit>(DefaultHttpRequestWriterFactory.java:46)
    at org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:82)
    at org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:95)
    at org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<init>(ManagedHttpClientConnectionFactory.java:104)
    at org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.<clinit>(ManagedHttpClientConnectionFactory.java:62)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.<init>(PoolingHttpClientConnectionManager.java:572)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:174)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:158)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:149)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.<init>(PoolingHttpClientConnectionManager.java:125)
    at com.gargoylesoftware.htmlunit.HttpWebConnection.createConnectionManager(HttpWebConnection.java:972)
    at com.gargoylesoftware.htmlunit.HttpWebConnection.getResponse(HttpWebConnection.java:161)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1321)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponse(WebClient.java:1238)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:346)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:415)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:400)
    at crawler.HtmlUnitCrawl.main(HtmlUnitCrawl.java:29)

可能是什么错误?

2 个答案:

答案 0 :(得分:0)

您有CLASSPATH冲突,因为您的代码可以正常运行。

请删除所有HttpComponents .jars,并使用HtmlUnit提供的内容。

此外,您可以查看以下使用的版本:

    Class<?> klass = DefaultHttpRequestWriterFactory.class;
    String location = klass.getProtectionDomain().getCodeSource().getLocation().toString();
    System.out.println(location);

哪个应该在您的情况下给出httpcore-4.4.1.jar的位置。

答案 1 :(得分:0)

我验证了HtmlUnit是否使用了我的项目中已经使用的版本。因此,我将HtmlUnit版本与我的项目版本兼容,并且一切正常。

Httpclient-4.2.1与HtmlUnit-2.21(使用httpclient-4.5.2.jar)发生冲突。因此,我更改为HtmlUnit 2.10(使用Httpclient-4.2.1),并且工作正常。

检查项目中哪些库冲突