我目前正在尝试使用jsoup来抓取this site。
public class Main {
public static void main(String[] args) {
Document doc = null;
try {
doc = Jsoup.connect("http://www.world-food.ru/ru-RU/about/exhibitor-list.aspx").get();
} catch (IOException e) {
e.printStackTrace();
}
Elements list = doc.getElementsByClass("name showframe");
for (int i = 0; i < list.size() ; i++) {
System.out.println(list.get(i).html() + " \n" + list.get(i).absUrl("href"));
}
}
}
我的问题是上面的代码只会从通过调用JavaScript函数加载的71个页面中删除第一页。
如何使用jsoup刮取其他页面?
答案 0 :(得分:0)
有问题的JavaScript函数只是向同一个网址发送POST
个请求__EVENTARGUMENT
,即页面的编号。
您可以通过模仿此行为轻松获取其他页面:
import org.jsoup.*;
import org.jsoup.Connection.Response;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import static java.net.URLEncoder.encode;
public static void main(String[] args){
String url = "http://www.world-food.ru/ru-RU/about/exhibitor-list.aspx";
String userAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0";
try {
Response response = Jsoup.connect(url).execute();
Document document = response.parse();
String viewState = encode(document.getElementById("__VIEWSTATE").attr("value"), "UTF-8");
String eventTarget = encode("p$lt$ctl12$pageplaceholder$p$lt$ctl01$UniPager$pagerElem", "UTF-8");
for(int i = 1; i < 72; ++i) {
document = Jsoup.connect(url).userAgent(userAgent)
.requestBody(
String.format(
"__EVENTTARGET=%s"
+ "&__EVENTARGUMENT=%d"
+ "&__VIEWSTATE=%s",
eventTarget, i, viewState ))
.cookies(response.cookies())
.post();
Elements list = document.getElementsByClass("name showframe");
for (int x = 0; x < list.size() ; x++) {
System.out.println(list.get(x).html() + " \n" + list.get(x).absUrl("href"));
}
}
} catch (Exception ex) {
// TODO Handle exceptions
ex.printStackTrace();
}
}
答案 1 :(得分:0)
所以,终于得到了这个......
dict