我正在使用HtmlUnit收集新闻评论。 (http://v.media.daum.net/v/20170111104708176)
成功收集了标题和正文。
但是因为它(新闻评论)异步工作,我无法用HtmlUnit收集评论。
当我在Chrome中运行网络跟踪时,它会在内部调用API,但在HtmlUnit中无效。 (http://comment.daum.net/apis/v1/posts/17542278/comments?parentId=0&offset=3&limit=10&sort=RECOMMEND)
这是我的代码。
import java.io.IOException;
import java.text.ParseException;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlDivision;
import com.gargoylesoftware.htmlunit.html.HtmlHeading3;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSection;
public class Test {
public static void main(String[] args) throws ParseException, IOException {
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getCookieManager().setCookiesEnabled(true);
String seedURL="http://v.media.daum.net/v/20170111104708176";
HtmlPage page = webClient.getPage(seedURL);
webClient.waitForBackgroundJavaScript(10 * 1000);
webClient.waitForBackgroundJavaScriptStartingBefore(10 * 1000);
String title ="";
String content ="";
title = ((HtmlHeading3)page.getByXPath("//*[@id='cSub']/div[1]/h3").get(0)).getTextContent();//news title
content = ((HtmlSection)page.getByXPath("//*[@id='harmonyContainer']/section").get(0)).getTextContent();//news contents
System.out.println(title);
System.out.println(content);
HtmlDivision reply = ((HtmlDivision)page.getByXPath("//*[@id='alex-area']").get(0));//news reply
System.out.println(reply.asXml());// No have child elements...T.T
}
}
请帮帮我
P.S:这是HtmlUnit Maven Code。
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.20</version>
</dependency>