使用JavaScript函数抓取加载的网页

时间:2017-09-13 13:16:44

标签: java web-scraping jsoup html-parsing

我目前正在尝试使用jsoup来抓取this site

到目前为止我的代码:

public class Main {
    public static void main(String[] args) {
        Document doc = null;
        try {
            doc = Jsoup.connect("http://www.world-food.ru/ru-RU/about/exhibitor-list.aspx").get();
        } catch (IOException e) {
            e.printStackTrace();
        }

        Elements list = doc.getElementsByClass("name showframe");

        for (int i = 0; i < list.size() ; i++) {
            System.out.println(list.get(i).html() + " \n" + list.get(i).absUrl("href"));
        }
    }
}

我的问题是上面的代码只会从通过调用JavaScript函数加载的71个页面中删除第一页。

如何使用jsoup刮取其他页面?

2 个答案:

答案 0 :(得分:0)

有问题的JavaScript函数只是向同一个网址发送POST个请求__EVENTARGUMENT,即页面的编号。
您可以通过模仿此行为轻松获取其他页面:

进口:

import org.jsoup.*;
import org.jsoup.Connection.Response;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

import static java.net.URLEncoder.encode;    

代码:

public static void main(String[] args){
    String url = "http://www.world-food.ru/ru-RU/about/exhibitor-list.aspx";
    String userAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0";
    try {
        Response  response = Jsoup.connect(url).execute();
        Document document = response.parse();

        String viewState = encode(document.getElementById("__VIEWSTATE").attr("value"), "UTF-8");
        String eventTarget = encode("p$lt$ctl12$pageplaceholder$p$lt$ctl01$UniPager$pagerElem", "UTF-8");

        for(int i = 1; i < 72; ++i) {
            document = Jsoup.connect(url).userAgent(userAgent)
                .requestBody(
                        String.format(
                                "__EVENTTARGET=%s"
                                + "&__EVENTARGUMENT=%d"
                                + "&__VIEWSTATE=%s",
                                eventTarget, i, viewState ))
                .cookies(response.cookies())
                .post();

            Elements list = document.getElementsByClass("name showframe");

            for (int x = 0; x < list.size() ; x++) {
                System.out.println(list.get(x).html() + " \n" + list.get(x).absUrl("href"));
            }
        }
    } catch (Exception ex) {
        // TODO Handle exceptions
        ex.printStackTrace();
    }
}

答案 1 :(得分:0)

所以,终于得到了这个......

dict