jsoup用类获取div元素

时间:2017-12-02 23:00:52

标签: java html web-scraping web-crawler jsoup

我是Jsoup解析的新手,我想获得此页面上所有公司的列表:https://angel.co/companies?company_types[]=Startup 现在,实现此目的的方法实际上是使用与我需要的div标签相关的div标签来检查页面。 但是,当我调用方法时:

Document doc = Jsoup.connect("https://angel.co/companies?company_types[]=Startup").get();
System.out.println(doc.html());

首先,我甚至无法在我的consol html输出中找到那些DIV标签,(那些应该给出公司列表的标签) 其次,即使我找到它,我怎么能找到一个具有类名的Div元素:

div class=" dc59 frw44 _a _jm"  

请原谅我的行话,我不知道该怎么做。

1 个答案:

答案 0 :(得分:1)

数据未嵌入页面中,但使用后续API调用检索数据:

对每个页面重复上述内容(因此每个页面需要一个新标记和一个id列表)。可以使用网络标签中的Chrome开发者控制台查看此过程。

第一个POST请求提供JSON输出,但第二个请求(GET)在JSON对象的属性中提供HTML数据。

以下摘录公司过滤器:

private static CompanyFilter getCompanyFilter(final String filter, final int page) throws IOException {

    String response = Jsoup.connect("https://angel.co/company_filters/search_data")
            .header("Content-Type", "application/x-www-form-urlencoded;charset=UTF-8")
            .header("X-Requested-With", "XMLHttpRequest")
            .data("filter_data[company_types][]=", filter)
            .data("sort", "signal")
            .data("page", String.valueOf(page))
            .userAgent("Mozilla")
            .ignoreContentType(true)
            .post().body().text();

    GsonBuilder gsonBuilder = new GsonBuilder();
    Gson gson = gsonBuilder.create();
    return gson.fromJson(response, CompanyFilter.class);
}

然后以下提取公司:

private static List<Company> getCompanies(final CompanyFilter companyFilter) throws IOException {

    List<Company> companies = new ArrayList<>();

    URLConnection urlConn = new URL("https://angel.co/companies/startups?" + companyFilter.buildRequest()).openConnection();
    urlConn.setRequestProperty("User-Agent", "Mozilla");
    urlConn.connect();
    BufferedReader reader = new BufferedReader(new InputStreamReader(urlConn.getInputStream(), "UTF-8"));
    HtmlContainer htmlObj = new Gson().fromJson(reader, HtmlContainer.class);

    Element doc = Jsoup.parse(htmlObj.getHtml());
    Elements data = doc.select("div[data-_tn]");

    if (data.size() > 0) {
        for (int i = 2; i < data.size(); i++) {
            companies.add(new Company(data.get(i).select("a").first().attr("title"),
                    data.get(i).select("a").first().attr("href"),
                    data.get(i).select("div.pitch").first().text()));
        }

    } else {
        System.out.println("no data");
    }
    return companies;
}

主要功能:

public static void main(String[] args) throws IOException {

    int pageCount = 1;
    List<Company> companies = new ArrayList<>();

    for (int i = 0; i < 10; i++) {

        System.out.println("get page n°" + pageCount);
        CompanyFilter companyFilter = getCompanyFilter("Startup", pageCount);
        pageCount++;
        System.out.println("digest     : " + companyFilter.getDigest());
        System.out.println("count      : " + companyFilter.getTotalCount());
        System.out.println("array size : " + companyFilter.getIds().size());
        System.out.println("page       : " + companyFilter.getpage());

        companies.addAll(getCompanies(companyFilter));

        if (companies.size() == 0) {
            break;
        } else {
            System.out.println("size     : " + companies.size());
        }
    }
}

CompanyCompanyFilter&amp; HtmlContainer是模型类:

class CompanyFilter {

    @SerializedName("ids")
    private List<Integer> mIds;

    @SerializedName("hexdigest")
    private String mDigest;

    @SerializedName("total")
    private String mTotalCount;

    @SerializedName("page")
    private int mPage;

    @SerializedName("sort")
    private String mSort;

    @SerializedName("new")
    private boolean mNew;

    public List<Integer> getIds() {
        return mIds;
    }

    public String getDigest() {
        return mDigest;
    }

    public String getTotalCount() {
        return mTotalCount;
    }

    public int getpage() {
        return mPage;
    }

    private String buildRequest() {
        String out = "total=" + mTotalCount + "&";
        out += "sort=" + mSort + "&";
        out += "page=" + mPage + "&";
        out += "new=" + mNew + "&";
        for (int i = 0; i < mIds.size(); i++) {
            out += "ids[]=" + mIds.get(i) + "&";
        }
        out += "hexdigest=" + mDigest + "&";
        return out;
    }
}

private static class Company {

    private String mLink;
    private String mName;
    private String mDescription;

    public Company(String name, String link, String description) {
        mLink = link;
        mName = name;
        mDescription = description;
    }

    public String getLink() {
        return mLink;
    }

    public String getName() {
        return mName;
    }

    public String getDescription() {
        return mDescription;
    }
} 

private static class HtmlContainer {

    @SerializedName("html")
    private String mHtml;

    public String getHtml() {
        return mHtml;
    }
}

完整代码也可用here