这是我的HTML。从此我想得到两个细节
出版商:Springer-Verlag, 价格:7,284美元
问题是所有外部和内部类名称都相同。请建议如何使用jsoup从HTML下面获得以上两个值。
<div class="details">
<div class="fullname">ANALYTICAL AND BIOANALYTICAL CHEMISTRY (2011)</div>
<div class="catbox">
<div class="catcontents">
<div class="contents_ct1">Eigenfactor Category:</div>
<div class="contents_ct2" style="margin-left: -5px;">ANALYTIC CHEMISTRY</div>
</div>
<div class="catcontents">
<div class="contents_ct1">ISI Category:</div>
<div class="contents_ct2" style="margin-left: -49px;">CO EA</div>
</div>
<div class="catcontents">
<div class="contents_ct1">Group:</div>
<div class="contents_ct2" style="margin-left: -80px;">Sci</div>
</div>
<div class="catcontents">
<div class="contents_ct1">Total Articles (5yrs):</div>
<div class="contents_ct2" style="margin-left: -12px;">3,544</div>
</div>
</div>
<div class="catbox" style="margin-left: 20px">
<div class="catcontents">
<div class="contents_ct1">Publisher:</div>
<div class="contents_ct2" style="margin-left: -55px;">Springer-Verlag</div>
</div>
<div class="catcontents">
<div class="contents_ct1">First Published:</div>
<div class="contents_ct2" style="margin-left: -35px;">2001</div>
</div>
<div class="catcontents">
<div class="contents_ct1"><a href="http://journalprices.com/" title="Prices provided by JournalPrices.com" target="_blank" style="font-size: 11px">Price:</a></div>
<div class="contents_ct2" style="margin-left: -80px;">$7,284</div>
</div>
<div class="catcontents">
<div class="contents_ct1">Cost Effectiveness:</div>
<div class="contents_ct2" style="margin-left: -18px;">1.0302</div>
</div>
</div>
<div class="tgraph">
<div class="plotB">
<iframe src="plot1.php?issn=1618-2642" width="370px" height="220px" frameborder=0 scrolling="no"></iframe>
</div>
<div class="plotB" style="margin-left: 10px">
<iframe src="plot2.php?issn=1618-2642" width="340px" height="220px" frameborder=0 scrolling="no"></iframe>
</div>
</div>
</div>
答案 0 :(得分:1)
静态HTML结构
假设布局始终遵循您提供的源的结构,您可以使用简单的CSS选择器语法来指定要解析的元素。
Element publisher = doc.select("div.catbox:eq(2) div.catcontents div.contents_ct2").first();
Element price = doc.select("div.catbox:eq(2) div.catcontents:eq(2) div.contents_ct2").first();
System.out.println("Publisher: " + publisher.text() + "\nPrice: " + price.text());
会导致打印输出
run:
Publisher: Springer-Verlag
Price: $7,284
动态HTML结构
如果结构不是一直相同,则下面的代码应该产生相同的结果,但检查元素的文本以正确识别它们。
Elements content = doc.select("div.catcontents");
Element publisher = null;
Element price = null;
for (Element element : content) {
if(element.text().startsWith("Publisher")){
publisher = element;
}
if(element.text().startsWith("Price")){
price = element;
}
}
System.out.println(publisher.text() + "\n" + price.text());