更新:我最终使用ghost.py,但会很感激。
我一直在使用直接的java / apache httpd和nio来抓取最近的页面,但遇到了我预期的一个实际上似乎不是的简单问题。我正在尝试使用html单元来抓取页面,但每次我运行下面的代码时,我都会得到错误,继续执行代码告诉我jar丢失了。不幸的是,我在这里找不到答案,因为这个问题有一个奇怪的部分。
所以,这是奇怪的部分。我有jar(lang3)它是最新的,它包含一个方法StringUtils.startsWithIgnoreCase(字符串字符串,字符串前缀)工作。我真的想避免使用硒,因为我需要抓取(如果采样正确地告诉我),在几个月内在同一个网站上大约有1000页。
我需要特定版本吗?我所看到的只是更新为3-1的注释。如果安装有效,是否有方法?
感谢。
我正在运行的代码是:
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.RefreshHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTable;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow;
public class crawl {
public crawl()
{
//TODO Constructor
crawl_page();
}
public void crawl_page()
{
//TODO control the crawling
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_10);
webClient.setRefreshHandler(new RefreshHandler() {
public void handleRefresh(Page page, URL url, int arg) throws IOException {
System.out.println("handleRefresh");
}
});
//the url for CA's Megan's law sex off
String url="http://www.myurl.com" //not my url
HtmlPage page;
try {
page = (HtmlPage) webClient.getPage(url);
HtmlForm form=page.getFormByName("_ctl0");
form.getInputByName("cbAgree").setChecked(true);
page=form.getButtonByName("Continue").click();
System.out.println(page.asText());
} catch (FailingHttpStatusCodeException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
错误是:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.commons.lang3.StringUtils.startsWithIgnoreCase(Ljava/lang/CharSequence;Ljava/lang/CharSequence;)Z
at com.gargoylesoftware.htmlunit.util.URLCreator$URLCreatorStandard.toUrlUnsafeClassic(URLCreator.java:66)
at com.gargoylesoftware.htmlunit.util.UrlUtils.toUrlUnsafe(UrlUtils.java:193)
at com.gargoylesoftware.htmlunit.util.UrlUtils.toUrlSafe(UrlUtils.java:171)
at com.gargoylesoftware.htmlunit.WebClient.<clinit>(WebClient.java:159)
at ca__soc.crawl.crawl_page(crawl.java:34)
at ca__soc.crawl.<init>(crawl.java:24)
at ca__soc.us_ca_ca_soc.main(us_ca_ca_soc.java:17)
答案 0 :(得分:1)
自: 2.4,3.0将startsWithIgnoreCase(String,String)中的签名更改为startsWithIgnoreCase(CharSequence,CharSequence)
所以,你的类路径上可能有两个类似的jar。