使用Jsoup进行HTML数据提取

时间:2013-04-16 17:35:37

标签: java html web-scraping jsoup

我正在尝试使用Jsoup从诺基亚开发者网站http://www.developer.nokia.com/Devices/Device_specifications/Nokia_Asha_308/中提取移动规范数据。 如何获取每个子类别的数据,如“相机功能”,“图形格式”等。分开。

import java.io.IOException;
import java.sql.SQLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Nokiareviews {
public static void main(String[] args) throws IOException, SQLException,    InterruptedException {
Document doc = Jsoup.connect("http://www.developer.nokia.com/Devices/Device_specifications/Nokia_Asha_308/").timeout(1000*1000).get();
Elements content = doc.select("div#accordeonContainer");
for (Element spec : content) {
System.out.println(spec.text());
}
}

}

1 个答案:

答案 0 :(得分:3)

如果您仔细观察,您会发现每个类别都是<div> class=accordeonContainer,其标题位于h2(在此下),子类别列表位于具有<dl> CSS类的"clearfix"

<div class="accordeonContainer accordeonExpanded">
    <h2 class=" accordeonTitle "><span>Multimedia</span></h2>
    <div class="accordeonContent" id="Multimedia" style="display: block;">
        <dl class="clearfix">
            <dt>Camera Resolution</dt>
            <dd>1600 x 1200 pixels  </dd>
                ...    
            <dt>Graphic Formats</dt>
            <dd>BMP, DCF, EXIF, GIF87a, GIF89a, JPEG, PNG, WBMP </dd>
            ...
        </dl>
    </div>
</div>

您可以使用以下方法选择特定类型(例如elm)和给定CSS类(例如clazz)的元素列表:

Elements elms = doc.select("elm.clazz");

然后,简而言之,提取您提到的信息的代码可以是以下内容:

public class Nokiareviews {
    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.connect("http://www.developer.nokia.com/Devices/Device_specifications/Nokia_Asha_308/")
                .timeout(1000 * 1000).get();
        Elements content = doc.select("div.accordeonContainer");
        for (Element spec : content) {
            Elements h2 = spec.select("h2.accordeonTitle");
            System.out.println(h2.text());

            Elements dl = spec.select("dl.clearfix");
            Elements dts = dl.select("dt");
            Elements dds = dl.select("dd");

            Iterator<Element> dtsIterator = dts.iterator();
            Iterator<Element> ddsIterator = dds.iterator();
            while (dtsIterator.hasNext() && ddsIterator.hasNext()) {
                Element dt =  dtsIterator.next();
                Element dd =  ddsIterator.next();
                System.out.println("\t\t" + dt.text() + "\t\t" + dd.text());
            }
        }
    }
}

如果使用maven,请务必将其添加到pom.xml

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.7.2</version>
</dependency>