我想使用HtmlUnit来收集新闻评论(异步网页)

时间:2017-01-12 08:45:58

标签: web-crawler htmlunit

我正在使用HtmlUnit收集新闻评论。 (http://v.media.daum.net/v/20170111104708176

成功收集了标题和正文。

但是因为它(新闻评论)异步工作,我无法用HtmlUnit收集评论。

当我在Chrome中运行网络跟踪时,它会在内部调用API,但在HtmlUnit中无效。 (http://comment.daum.net/apis/v1/posts/17542278/comments?parentId=0&offset=3&limit=10&sort=RECOMMEND

这是我的代码。

import java.io.IOException;
import java.text.ParseException;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlDivision;
import com.gargoylesoftware.htmlunit.html.HtmlHeading3;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSection;


public class Test {

    public static void main(String[] args) throws ParseException, IOException {



        WebClient webClient = new WebClient(BrowserVersion.CHROME);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getCookieManager().setCookiesEnabled(true);
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());

        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getCookieManager().setCookiesEnabled(true);

        String seedURL="http://v.media.daum.net/v/20170111104708176";

        HtmlPage page = webClient.getPage(seedURL);
        webClient.waitForBackgroundJavaScript(10 * 1000);
        webClient.waitForBackgroundJavaScriptStartingBefore(10 * 1000);

        String title ="";
        String content ="";

        title = ((HtmlHeading3)page.getByXPath("//*[@id='cSub']/div[1]/h3").get(0)).getTextContent();//news title       
        content = ((HtmlSection)page.getByXPath("//*[@id='harmonyContainer']/section").get(0)).getTextContent();//news contents
        System.out.println(title); 
        System.out.println(content); 
        HtmlDivision reply =    ((HtmlDivision)page.getByXPath("//*[@id='alex-area']").get(0));//news reply
        System.out.println(reply.asXml());// No have child elements...T.T

    }

}

请帮帮我

P.S:这是HtmlUnit Maven Code。

    <dependency>
        <groupId>net.sourceforge.htmlunit</groupId>
        <artifactId>htmlunit</artifactId>
        <version>2.20</version>
    </dependency>

0 个答案:

没有答案