我是Jsoup解析的新手,我想获得此页面上所有公司的列表:https://angel.co/companies?company_types[]=Startup 现在,实现此目的的方法实际上是使用与我需要的div标签相关的div标签来检查页面。 但是,当我调用方法时:
Document doc = Jsoup.connect("https://angel.co/companies?company_types[]=Startup").get();
System.out.println(doc.html());
首先,我甚至无法在我的consol html输出中找到那些DIV标签,(那些应该给出公司列表的标签) 其次,即使我找到它,我怎么能找到一个具有类名的Div元素:
div class=" dc59 frw44 _a _jm"
请原谅我的行话,我不知道该怎么做。
答案 0 :(得分:1)
数据未嵌入页面中,但使用后续API调用检索数据:
ids
阵列&名为hexdigest
对每个页面重复上述内容(因此每个页面需要一个新标记和一个id列表)。可以使用网络标签中的Chrome开发者控制台查看此过程。
第一个POST
请求提供JSON输出,但第二个请求(GET
)在JSON对象的属性中提供HTML数据。
以下摘录公司过滤器:
private static CompanyFilter getCompanyFilter(final String filter, final int page) throws IOException {
String response = Jsoup.connect("https://angel.co/company_filters/search_data")
.header("Content-Type", "application/x-www-form-urlencoded;charset=UTF-8")
.header("X-Requested-With", "XMLHttpRequest")
.data("filter_data[company_types][]=", filter)
.data("sort", "signal")
.data("page", String.valueOf(page))
.userAgent("Mozilla")
.ignoreContentType(true)
.post().body().text();
GsonBuilder gsonBuilder = new GsonBuilder();
Gson gson = gsonBuilder.create();
return gson.fromJson(response, CompanyFilter.class);
}
然后以下提取公司:
private static List<Company> getCompanies(final CompanyFilter companyFilter) throws IOException {
List<Company> companies = new ArrayList<>();
URLConnection urlConn = new URL("https://angel.co/companies/startups?" + companyFilter.buildRequest()).openConnection();
urlConn.setRequestProperty("User-Agent", "Mozilla");
urlConn.connect();
BufferedReader reader = new BufferedReader(new InputStreamReader(urlConn.getInputStream(), "UTF-8"));
HtmlContainer htmlObj = new Gson().fromJson(reader, HtmlContainer.class);
Element doc = Jsoup.parse(htmlObj.getHtml());
Elements data = doc.select("div[data-_tn]");
if (data.size() > 0) {
for (int i = 2; i < data.size(); i++) {
companies.add(new Company(data.get(i).select("a").first().attr("title"),
data.get(i).select("a").first().attr("href"),
data.get(i).select("div.pitch").first().text()));
}
} else {
System.out.println("no data");
}
return companies;
}
主要功能:
public static void main(String[] args) throws IOException {
int pageCount = 1;
List<Company> companies = new ArrayList<>();
for (int i = 0; i < 10; i++) {
System.out.println("get page n°" + pageCount);
CompanyFilter companyFilter = getCompanyFilter("Startup", pageCount);
pageCount++;
System.out.println("digest : " + companyFilter.getDigest());
System.out.println("count : " + companyFilter.getTotalCount());
System.out.println("array size : " + companyFilter.getIds().size());
System.out.println("page : " + companyFilter.getpage());
companies.addAll(getCompanies(companyFilter));
if (companies.size() == 0) {
break;
} else {
System.out.println("size : " + companies.size());
}
}
}
Company
,CompanyFilter
&amp; HtmlContainer
是模型类:
class CompanyFilter {
@SerializedName("ids")
private List<Integer> mIds;
@SerializedName("hexdigest")
private String mDigest;
@SerializedName("total")
private String mTotalCount;
@SerializedName("page")
private int mPage;
@SerializedName("sort")
private String mSort;
@SerializedName("new")
private boolean mNew;
public List<Integer> getIds() {
return mIds;
}
public String getDigest() {
return mDigest;
}
public String getTotalCount() {
return mTotalCount;
}
public int getpage() {
return mPage;
}
private String buildRequest() {
String out = "total=" + mTotalCount + "&";
out += "sort=" + mSort + "&";
out += "page=" + mPage + "&";
out += "new=" + mNew + "&";
for (int i = 0; i < mIds.size(); i++) {
out += "ids[]=" + mIds.get(i) + "&";
}
out += "hexdigest=" + mDigest + "&";
return out;
}
}
private static class Company {
private String mLink;
private String mName;
private String mDescription;
public Company(String name, String link, String description) {
mLink = link;
mName = name;
mDescription = description;
}
public String getLink() {
return mLink;
}
public String getName() {
return mName;
}
public String getDescription() {
return mDescription;
}
}
private static class HtmlContainer {
@SerializedName("html")
private String mHtml;
public String getHtml() {
return mHtml;
}
}
完整代码也可用here