我目前正在抓取链接,以从该网站访问每个单独的项目:
https://southwesthumane.org/adopt/dogs/
但是此url有多个由JS呈现的页面。 JS网站的源代码如下:
<span id="ContentPlaceHolder_Item3_AdoptionDogs_2_dpDogs"><a href="javascript:__doPostBack('ctl00$ctl00$ContentPlaceHolder$Item3$AdoptionDogs_2$dpDogs$ctl00$ctl00','')">Previous</a> <a href="javascript:__doPostBack('ctl00$ctl00$ContentPlaceHolder$Item3$AdoptionDogs_2$dpDogs$ctl01$ctl00','')">1</a> <span>2</span> <a href="javascript:__doPostBack('ctl00$ctl00$ContentPlaceHolder$Item3$AdoptionDogs_2$dpDogs$ctl01$ctl02','')">3</a> <a href="javascript:__doPostBack('ctl00$ctl00$ContentPlaceHolder$Item3$AdoptionDogs_2$dpDogs$ctl01$ctl03','')">4</a> <a href="javascript:__doPostBack('ctl00$ctl00$ContentPlaceHolder$Item3$AdoptionDogs_2$dpDogs$ctl01$ctl04','')">5</a> <a href="javascript:__doPostBack('ctl00$ctl00$ContentPlaceHolder$Item3$AdoptionDogs_2$dpDogs$ctl02$ctl00','')">Next</a> </span>
现在我仅从第一页抓取数据,而且我也不知道如何访问其余页面从那里抓取数据。
到目前为止,这是我的代码:
public static void main(String args[]){
try{
Document dogs = Jsoup.connect("https://southwesthumane.org/adopt/dogs/").get();
Elements links_dogs = dogs.select(":containsOwn(Details »)");
//***********************DOGS*****************************
for (Element link : links_dogs) {
String url = "https://southwesthumane.org" + link.attr("href");
System.out.println("\nurl: " + url);
try{
int index = 0;
Document dog = Jsoup.connect(url).userAgent("Mozilla/5.0").get();
Elements name = dog.select("h3");
Elements description = dog.select("div.Animaldetails");
Elements details = dog.select("div.AnimalDetails > strong");
Elements img = dog.select("img[src~=.(jpg|jpeg)]");
for (Element code : name) {
if (index % 2 == 1)
System.out.println("Name: " + code.text());
index++;
}
for (Element code : img) {
System.out.println("Image: " + code.attr("src"));
}
for (Element code : description) {
System.out.println("Description: " + code.select("p").text());
}
for (Element code : details) {
System.out.println(code.text() + " " + code.nextSibling().toString());
}
} catch (IOException e) {
e.printStackTrace();
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
例如,现在有5页,而我仅访问第一页,我想访问其余可用页面。