Jsoup只抓取有限数量的链接

时间:2015-12-20 11:29:20

标签: jsoup

我是jsoup的新手,我需要使用它,但这是一个问题。只抓取了有限数量的链接。我爬行http://shais.net/,我只看到35个abs url,而它至少有430个链接。这是我的代码:

public static void main(String[] args) throws SQLException, IOException {

        PreparedStatement statement = db.Connection.connection.prepareStatement("truncate record;");
        statement.execute();

        processPage("http://shais.net/");//TODO


    }

    public static void processPage(String URL) throws SQLException, IOException {

        String sql = "select * from Record where URL = '"+URL+"'";
        PreparedStatement select = db.Connection.connection.prepareStatement(sql);
        ResultSet result = select.executeQuery();
        if(result.next()){

        }else{
            sql = "insert into record"+" (URL) values"+"('"+URL+"')";
            PreparedStatement statement = db.Connection.connection.prepareStatement(sql,Statement.RETURN_GENERATED_KEYS);
            statement.execute();

            org.jsoup.nodes.Document doc =Jsoup.connect("http://shais.net/").header("Accept-Encoding", "gzip, deflate")//TODO
                    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
                    .maxBodySize(0)
                    .timeout(600000).get(); 

            if(doc.text().contains("research")){

                System.out.println(URL);

            }


            Elements questions = doc.select("a[href]");
            for(Element link:questions){
            if(link.attr("href").contains("shais.net"))
                    processPage(link.attr("abs:href"));
                    System.out.println(link.attr("abs:href"));

            }


        }

    }

请帮我解决问题所在。
感谢。

0 个答案:

没有答案