Jsoup选择div标签返回空。是因为页面源中不存在标记吗?

时间:2016-02-17 17:53:35

标签: java html jsoup

我正在尝试使用Jsouphere标记<a>内的<li>内选择链接表单<div class="paging">

enter image description here

这是上面网页的html代码(ajax):

 $.ajax({
     url: urlnext,
     beforeSend: function (xhr) {
         $('#search-load').show();
     },
     success: function (result) {
         $('#search-content').html(result);
         $('.content-left').show();
         $('#search-load').hide();
         $('.paging li a').each(function () {
             var linkPage = $(this).attr('href');
             $(this).click(function (event) {
                 event.preventDefault();
                 location.href = linkPage;
             });
         });
     }
 });

这是我的完整代码

1   import java.io.IOException;
2   import java.io.PrintWriter;
3   import java.util.regex.Matcher;
4   import java.util.regex.Pattern;
5
6   import org.jsoup.Connection;
7   import org.jsoup.HttpStatusException;
8   import org.jsoup.Jsoup;
9   import org.jsoup.nodes.Document;
10  import org.jsoup.nodes.Element;
11  import org.jsoup.select.Elements;
12
13  public class Main_sindo {
14 
15      public static void main(String[] args) IOException {
16          int searchPageNumber = 1;
17          prosesCrawling_sindo("http://search.sindonews.com/search?type=artikel&q=bni", searchPageNumber); //bni
18      }
19
20      public static void prosesCrawling_sindo(String URL){
21          try{
22              Connection.Response response = Jsoup.connect(URL).userAgent("Mozilla/5.0").timeout(0).execute();
23              int statusCode = response.statusCode();
24              if(statusCode == 200){
25                  Document dok = Jsoup.connect(URL).userAgent("Mozilla/5.0").timeout(0).get();
26                  Element newsTitle = dok.select("div.article h1").first();
27                  Element newsContent = dok.select("div#content").first();
28                  if(newsContent != null){
29                      
30                      String fileName = dok.title().replaceAll("[\\/\"?:*<>|]+", " ");
31                      PrintWriter writer = new PrintWriter(namaFile+".txt");
32                      writer.println(isiBerita.text());
33                      writer.close();
34                  }
35              
36                  //get all links and recursively call the processPage method
37                  Elements newsPages = dok.select("div.news-content a");
38                  for(Element newsPage: newsPages){
39                      if(newsPage.attr("href").contains("sindonews.com")){
40                          prosesCrawling_sindo(newsPage.attr("abs:href"), searchPageNumber);
41                          System.out.println(newsPage.attr("abs:href"));
42                      }
43                  }
44                  
45                  //access next search page
46                  Elements nextPages = dok.select("div.paging > li > a"); //here is the problem. It seem Jsoup cannot select div class=paging (empty)
47                  for(Element nextPage: nextPages){
48                      if(nextPage != null && Integer.parseInt(nextPage.text())==searchPageNumber){
49                          if(nextPage.attr("href").contains("sindonews.com")){
50                              prosesCrawling_sindo(nextPage.attr("abs:href"), searchPageNumber);
51                          }
52                          searchPageNumber+=1;
53                      }
54                  }
55              }
56          }catch (NullPointerException e) {
57              // TODO Auto-generated catch block
58              e.printStackTrace();
59          } catch (HttpStatusException e) {
60              e.printStackTrace();
61          } catch (IOException e) {
62              // TODO Auto-generated catch block
63              e.printStackTrace();
64          }
65      }
66  }

我猜的问题是第46行。

感谢您的帮助:)

1 个答案:

答案 0 :(得分:0)

您应该更改if条件,首先应检查paging null,因为方法org.jsoup.select.Elements#first可以返回null (你可以轻松检查它,因为jSoup是开源库)。

如果您尝试调用null引用的方法,您肯定会获得NPE。所以你应该按相反的顺序进行检查:

if(paging == null){
    System.out.println("this is null");
}
if(paging.isEmpty()){
    System.out.println("this is empty");
}

是的,这是因为jSoup无法找到与您的选择器匹配的元素。请检查jSoup文档,确保您使用正确的选择器格式,并且您的文档具有指定的元素。