迭代网站上的字母标签,使用jsoup从字母顺序页面中删除数据

时间:2018-03-07 13:51:55

标签: java web-scraping jsoup

迭代下面代码中给出的网站字母表页面的最佳方法是什么?我应该使用关键字字符串并迭代它并将关键字附加到网址,还是应该从第一页上的字母按钮工具栏中提取网址?我在下面的代码中尝试了两种情况并获得空指针异常。请帮忙。

    public static void main(String[] args) throws Exception {

            String[] keywords = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"};
            String keyword = "a";
            //for (String keyword : keywords){
                String url = "http://www.medindia.net/drug-price/brand-index.asp?alpha="+keyword; 
                Document doc = Jsoup.connect(url).get();
                Element alphabetList = doc.select("div.class.btn-group.btn-group-sm").first();
                Elements alphabets = alphabetList.select("a[href]");
                for (Element alphabet : alphabets){
                    System.out.println("link : " + alphabet.attr("href"));
                    System.out.println("text : " + alphabet.text());

                }
}

修改

我想到了如何正确地做到这一点。但是我得到一个读取超时错误,因为这可能不是刮掉这么多页面的最有效方法。建议欢迎提高代码效率。

public static void main(String[] args) throws Exception {
        Map<String,String> drugLinks = new LinkedHashMap<String,String>();
        final int OK = 200;
        String currentURL;
        int page = 1;
        int status = OK;
        Connection.Response response = null;
        Document doc = null;
        String[] keywords = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"};
        //String keyword = "a";
        for (String keyword : keywords){
            final String url = "https://www.medindia.net/drug-price/brand-index.asp?alpha="+keyword; 
            while (status == OK) {
                currentURL = url +"&page="+ String.valueOf(page); 
                response = Jsoup.connect(currentURL)
                        .userAgent("Mozilla/5.0")
                        .execute();
                status = response.statusCode();


                if (status == OK) {
                    doc = response.parse();

                    Element table = doc.select("table").get(1);

                    for (Element rows : table.select("tr")) {
                        for (Element tds : rows.select("td")) {
                            Elements links = tds.select("a[href]");
                            for (Element link : links) {
                                drugLinks.put(link.text(), link.attr("href"));
                                System.out.println("link : " + link.attr("href"));
                                System.out.println("text : " + link.text());
                            }
                        }
                    }

                }
                page++;
            }

        }
    }

0 个答案:

没有答案