Question

我正在尝试下载给定新闻文章的所有评论（在www.theguardian.com）我可以用Java获取文章并用Jsoup解析以获取评论的URL但是当我尝试下载时我只得到默认页面和评论数量（50）。例如，评论的网址可能是 http://discussion.theguardian.com/discussion/p/2nzaq

如果我在Firefox中加载此页面并使用我的用户ID登录，我会选择显示所有评论，并且网址会变为 ... / P / 2nzaq＃节目-所有

但是在给定此url时仍然是java只下载与... / p / 2nzaq相同的默认50条评论？orderby = newest＆amp; per_page = 50＆amp; commentpage = 1

现在我想到在命令提示符（windows）中尝试wget或aria2，或者通过在java中执行shell命令来获取任何这些url的注释，并且仍然是相同的默认注释页面和数字。 Firefox似乎没有问题显示和下载所有评论。如何在java中自动执行此操作？感谢

以下评论

用

尝试HttpClient

public class DownloadFile {

public static void getFile(String url, String filepath) throws ClientProtocolException, IOException {
    HttpClient httpClient = new DefaultHttpClient();        
    HttpGet httpget = new HttpGet(url);
    HttpResponse response = httpClient.execute(httpget);
    HttpEntity entity = response.getEntity();
    if (entity != null) {
        //long len = entity.getContentLength();
        InputStream inputStream = entity.getContent();
        BufferedInputStream bis = new BufferedInputStream(entity.getContent());
        BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(new File(filepath)));
        int inByte;
        while((inByte = bis.read()) != -1) bos.write(inByte);
        bis.close();
        bos.close();
    }
    }

    public static void main(String[] args) throws IOException {
        Integer ii = 3;
        String MyUrl = "http://discussion.theguardian.com/discussion/p/2nzaq?orderby=newest&per_page=50&commentpage=" + ii.toString();
        String MyFilePath = "./testfile" + ii.toString() + ".htm";
        getFile(MyUrl,MyFilePath);  
}

}

也尝试类似于“... / p / 2nzaq＃show-all”我确实发现HttpClient教程是错误的，你无法实例化HttpClient httpClient = new HttpClient（）;这导致HttpClient是抽象的;无法实例化---我在这里发现另一篇帖子中的HttpClient httpClient = new DefaultHttpClient（）;没关系

Answer 1

我相信你需要一个浏览器才能做到这一点。您可以使用Selenium从Java控制浏览器。设置它非常简单，需要几分钟，请参考我的答案：Running a WebDriver Test without using ANT, Maven, JUnit or Eclipse。

在Selenium中打开该URL后，您将获得当前页面中的所有注释，然后以编程方式单击下一步按钮并循环直至到达最终页面。

在Java中下载网页及其资源

1 个答案: