使用包含多个页面的多个字母选项卡刮取网站

时间:2018-03-06 16:50:17

标签: java web-scraping jsoup

我正在抓取一个网站,该网站的数据按字母顺序列在A-Z标签中,每个字母标签还包含几个页面。如何从中提取所有网址?

public static void main(String [] args)抛出异常{

String keyword = "a";
String url = "http://www.medindia.net/drug-price/brand-index.asp?alpha=" + keyword; 

Document doc = Jsoup.connect(url).get();
//Elements pages = doc.select("div.pagination a"); 
Element table = doc.select("table").get(1);

for (Element row : table.select("tr")) {
    for (Element tds : row.select("td")) {
        Elements links = tds.select("a[href]");
        for (Element link : links) {
            System.out.println("link : " + link.attr("href"));
            System.out.println("text : " + link.text());
           }
        }
    }

1 个答案:

答案 0 :(得分:0)

所以我能够弄清楚如何从每个字母标签和每个字母标签中的每个页面中抓取数据。下面是代码。然而,在刮几百个链接后,我得到一个读取超时错误。有没有一种有效的方法来做到这一点?我可以应用多线程吗?

public static void main(String[] args) throws Exception {

        final int OK = 200;
        String currentURL;
        int page = 1;
        int status = OK;
        Connection.Response response = null;
        Document doc = null;
        String[] keywords = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"};
        //String keyword = "a";
        for (String keyword : keywords){
            final String url = "https://www.medindia.net/drug-price/brand-index.asp?alpha="+keyword; 
            while (status == OK) {
                currentURL = url +"&page="+ String.valueOf(page); 
                response = Jsoup.connect(currentURL)
                        .userAgent("Mozilla/5.0")
                        .execute();
                status = response.statusCode();


                if (status == OK) {
                    doc = response.parse();

                    Element table = doc.select("table").get(1);

                    for (Element rows : table.select("tr")) {
                        for (Element tds : rows.select("td")) {
                            Elements links = tds.select("a[href]");
                            for (Element link : links) {
                                System.out.println("link : " + link.attr("href"));
                                System.out.println("text : " + link.text());
                            }
                        }
                    }

                }
                page++;
            }

        }
    }