从Iframe页面进行Web抓取

时间:2016-09-08 07:27:32

标签: iframe web-scraping

我在抓取页面上的一些数据方面遇到了很多麻烦(一个例子是http://www.arena-offshore.com/crew-boats.html)。我们有权获取数据,但他们太忙了#34;以任何形式将它交给我们。

我已经尝试过针对Chrome的Web Scraper插件,import.io,并开始研究更复杂的程序,但它们有点超出我的范围。对于初学者来说,没有一个程序似乎能够识别出每个船只的不同链接,所以我甚至可以达到抓取各个领域的程度。所以我想如果有人知道如何抓住每艘船的不同链接,我可以解决剩下的问题。有人有主意吗?我知道我的技能并不是最好的,但希望有人能指出我正确的方向。

非常感谢

亚历

1 个答案:

答案 0 :(得分:0)

如何将Java与HTML解析器jsoup一起使用? Jsoup是一个很好的工具,用于解析网站(如果它们不依赖于javascript),同时使用CSS选择器来获取特定的HTML元素。

Java代码

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class Scraper {

    List<Category> categories = new ArrayList<>();

    static class Category {
        private int categoryNumber;
        private String title;

        public Category(int categoryNumber, String title) {
            this.categoryNumber = categoryNumber;
            this.title = title;
        }

        public int getCategoryNumber() {
            return categoryNumber;
        }

        public String getTitle() {
            return title;
        }
    }

    public Scraper(){
        categories.add(new Category(1, "CREW BOATS"));
        categories.add(new Category(2, "TUG BOATS"));
        categories.add(new Category(3, "AHT & AHTS"));
        categories.add(new Category(4, "SUPPLY/UTILITY VESSELS"));
        categories.add(new Category(5, "BARGES"));
        categories.add(new Category(6, "MISCELLANEOUS"));
    }

    private void scrapeCategory(Category category){

        System.out.println("\n"+category.getTitle()+"\n");
        String searchUrl = "http://www.arena-offshore.com/iframe/list/index.php?category=" + category.getCategoryNumber() + "&page=";
        int pageIndex=1;
        Document doc;

        while (true) {

            try {
                doc = Jsoup.connect(searchUrl + pageIndex)
                        .userAgent(
                                "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36")
                        .referrer("http://www.arena-offshore.com/").get();

                if (doc.select("#pu132").isEmpty()) { // no more results
                    break;
                }

                for (Element element : doc.select("#pu132")) {
                    String boat = element.select("[data-muse-uid=\"U159\"]").first().text(); //ID
                    boat += "\n\t\t" + element.select("a").first().attr("href"); //HREF
                    boat += "\n\t\t" + element.select("[data-muse-uid=\"U158\"]").get(0).text(); //TYPE
                    boat += "\n\t\t" + element.select("[data-muse-uid=\"U158\"]").get(1).text(); // LOCATION
                    boat += "\n\t\t" + element.select("[data-muse-uid=\"U153\"]").first().text(); // BRIEF DETAILS
                    System.out.println("\t"+boat);
                }

                pageIndex++;

            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    public void scrapeAllCategories(){
        for (Category category : categories) {
            scrapeCategory(category);
        }
    }

    public static void main(String[] args) {
        new Scraper().scrapeAllCategories();
    }

}

注意:您需要download the jsoup core libraryadd it to your build path

<强>输出

CREW BOATS

    AR-C1002
        http://www.arena-offshore.com/agent-boat-for-sale-AR-C1002.html
        AGENT BOAT
        EAST MED.
        FOR SALE 14 X 4 X 1.9 (DEPTH)M, 2007 BUILT 630 BHP, 12 PERSONEL, IN EAST MED.

    ...

    AR-C1000
        http://www.arena-offshore.com/AR-KED.html?page=13
        CREW BOAT
        SOUTH AMERICA
        FOR SALE 17 X 5 X 2.18M, 2009 BUILT 1200 BHP, IN SOUTH AMERICA

TUG BOATS

    AR-KTK
        http://www.arena-offshore.com/single-screw-tug-boat-AR-KTK.html
        SINGLE SCREW
        TURKEY
        FOR SALE 1998 BUILT/ 2008 REBUILT 1000 HP / 16 TBP

    ...

AHT & AHTS

    AR-RZA
        http://www.arena-offshore.com/AR-RZA.html
        ANCHOR HANDLING TUG / TOWING
        AFRICA
        FOR SALE 36 X 10 X 4 M (MAX DRAFT) 4400 BHP / 58 TBP

    ...

SUPPLY/UTILITY VESSELS

    AR-U5001
        http://www.arena-offshore.com/survey-vessel-in-south-east-asia-AR-U5001.html
        SURVEY SUPPOT VESSEL
        SOUTH EAST ASIA
        FOR SALE 20 X 6 X 1.5 (DRAFT)M, 2012 BUILT, IRS CLASS 650 BHP, 50 M2 DECK SPACE

    ...

BARGES

    AR-KLM
        http://www.arena-offshore.com/AR-KLM.html
        ACCOMMODATION
        SHETLANDS
        FOR CHARTER 1993 BUILT MAJOR CONVERSION 2004 AND 2013

    ...

MISCELLANEOUS

    AR-SAA
        http://www.arena-offshore.com/AR-SAA.html
        SHALLOW DRAFT MPP / WORKBOAT
        -
        FOR CHARTER 2800 HP / 37.3 TBP CERTIFIED

    ...

注意:缩短输出,而不是...打印出更多结果。