我正在尝试使用Jsoup从诺基亚开发者网站http://www.developer.nokia.com/Devices/Device_specifications/Nokia_Asha_308/中提取移动规范数据。 如何获取每个子类别的数据,如“相机功能”,“图形格式”等。分开。
import java.io.IOException;
import java.sql.SQLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Nokiareviews {
public static void main(String[] args) throws IOException, SQLException, InterruptedException {
Document doc = Jsoup.connect("http://www.developer.nokia.com/Devices/Device_specifications/Nokia_Asha_308/").timeout(1000*1000).get();
Elements content = doc.select("div#accordeonContainer");
for (Element spec : content) {
System.out.println(spec.text());
}
}
}
答案 0 :(得分:3)
如果您仔细观察,您会发现每个类别都是<div>
class=accordeonContainer
,其标题位于h2
(在此下),子类别列表位于具有<dl>
CSS类的"clearfix"
:
<div class="accordeonContainer accordeonExpanded">
<h2 class=" accordeonTitle "><span>Multimedia</span></h2>
<div class="accordeonContent" id="Multimedia" style="display: block;">
<dl class="clearfix">
<dt>Camera Resolution</dt>
<dd>1600 x 1200 pixels </dd>
...
<dt>Graphic Formats</dt>
<dd>BMP, DCF, EXIF, GIF87a, GIF89a, JPEG, PNG, WBMP </dd>
...
</dl>
</div>
</div>
您可以使用以下方法选择特定类型(例如elm
)和给定CSS类(例如clazz
)的元素列表:
Elements elms = doc.select("elm.clazz");
然后,简而言之,提取您提到的信息的代码可以是以下内容:
public class Nokiareviews {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://www.developer.nokia.com/Devices/Device_specifications/Nokia_Asha_308/")
.timeout(1000 * 1000).get();
Elements content = doc.select("div.accordeonContainer");
for (Element spec : content) {
Elements h2 = spec.select("h2.accordeonTitle");
System.out.println(h2.text());
Elements dl = spec.select("dl.clearfix");
Elements dts = dl.select("dt");
Elements dds = dl.select("dd");
Iterator<Element> dtsIterator = dts.iterator();
Iterator<Element> ddsIterator = dds.iterator();
while (dtsIterator.hasNext() && ddsIterator.hasNext()) {
Element dt = dtsIterator.next();
Element dd = ddsIterator.next();
System.out.println("\t\t" + dt.text() + "\t\t" + dd.text());
}
}
}
}
如果使用maven,请务必将其添加到pom.xml
:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.2</version>
</dependency>