我正在尝试使用Jsoup
从here标记<a>
内的<li>
内选择链接表单<div class="paging">
。
这是上面网页的html代码(ajax):
$.ajax({
url: urlnext,
beforeSend: function (xhr) {
$('#search-load').show();
},
success: function (result) {
$('#search-content').html(result);
$('.content-left').show();
$('#search-load').hide();
$('.paging li a').each(function () {
var linkPage = $(this).attr('href');
$(this).click(function (event) {
event.preventDefault();
location.href = linkPage;
});
});
}
});
这是我的完整代码
1 import java.io.IOException;
2 import java.io.PrintWriter;
3 import java.util.regex.Matcher;
4 import java.util.regex.Pattern;
5
6 import org.jsoup.Connection;
7 import org.jsoup.HttpStatusException;
8 import org.jsoup.Jsoup;
9 import org.jsoup.nodes.Document;
10 import org.jsoup.nodes.Element;
11 import org.jsoup.select.Elements;
12
13 public class Main_sindo {
14
15 public static void main(String[] args) IOException {
16 int searchPageNumber = 1;
17 prosesCrawling_sindo("http://search.sindonews.com/search?type=artikel&q=bni", searchPageNumber); //bni
18 }
19
20 public static void prosesCrawling_sindo(String URL){
21 try{
22 Connection.Response response = Jsoup.connect(URL).userAgent("Mozilla/5.0").timeout(0).execute();
23 int statusCode = response.statusCode();
24 if(statusCode == 200){
25 Document dok = Jsoup.connect(URL).userAgent("Mozilla/5.0").timeout(0).get();
26 Element newsTitle = dok.select("div.article h1").first();
27 Element newsContent = dok.select("div#content").first();
28 if(newsContent != null){
29
30 String fileName = dok.title().replaceAll("[\\/\"?:*<>|]+", " ");
31 PrintWriter writer = new PrintWriter(namaFile+".txt");
32 writer.println(isiBerita.text());
33 writer.close();
34 }
35
36 //get all links and recursively call the processPage method
37 Elements newsPages = dok.select("div.news-content a");
38 for(Element newsPage: newsPages){
39 if(newsPage.attr("href").contains("sindonews.com")){
40 prosesCrawling_sindo(newsPage.attr("abs:href"), searchPageNumber);
41 System.out.println(newsPage.attr("abs:href"));
42 }
43 }
44
45 //access next search page
46 Elements nextPages = dok.select("div.paging > li > a"); //here is the problem. It seem Jsoup cannot select div class=paging (empty)
47 for(Element nextPage: nextPages){
48 if(nextPage != null && Integer.parseInt(nextPage.text())==searchPageNumber){
49 if(nextPage.attr("href").contains("sindonews.com")){
50 prosesCrawling_sindo(nextPage.attr("abs:href"), searchPageNumber);
51 }
52 searchPageNumber+=1;
53 }
54 }
55 }
56 }catch (NullPointerException e) {
57 // TODO Auto-generated catch block
58 e.printStackTrace();
59 } catch (HttpStatusException e) {
60 e.printStackTrace();
61 } catch (IOException e) {
62 // TODO Auto-generated catch block
63 e.printStackTrace();
64 }
65 }
66 }
我猜的问题是第46行。
感谢您的帮助:)
答案 0 :(得分:0)
您应该更改if
条件,首先应检查paging
null
,因为方法org.jsoup.select.Elements#first
可以返回null
(你可以轻松检查它,因为jSoup是开源库)。
如果您尝试调用null
引用的方法,您肯定会获得NPE。所以你应该按相反的顺序进行检查:
if(paging == null){
System.out.println("this is null");
}
if(paging.isEmpty()){
System.out.println("this is empty");
}
是的,这是因为jSoup无法找到与您的选择器匹配的元素。请检查jSoup文档,确保您使用正确的选择器格式,并且您的文档具有指定的元素。