使用JSoup聚合数据

时间:2013-11-13 22:56:39

标签: java css-selectors web-crawler jsoup

我正在尝试使用JSouphttp://dictionary.reference.com/browse/quick获取一些内容。如果你去那个页面,你会发现他们组织数据的方式是将 quick 这个词的每个“单词类型”(形容词,动词,名词)呈现为它自己的部分,并且每个部分包含1个以上的定义列表。

为了使事情变得更复杂,每个定义中的每个单词都链接到另一个dictionary.com页面:

quick
    adjective
        1. done, proceeding, or occurring with promptness or rapidity...
        2. that is over or completed within a short interval of time
        ...
        14. Archaic.
            a. endowed with life
            b. having a high degree of vigor, energy, ...
    noun
        1. living persons; the quick and the dead
        2. the tender, sensitive flesh of the living body...
        ...
    adverb
        ...

我想要做的是使用JSoup来获取单词类型及其各自的定义作为字符串列表,如下所示:

public class Metadata {
    // Ex: "adjective", "noun", etc.
    private String wordType;

    // Ex: String #1: "1. done, proceeding, or occurring with promptness or rapidity..."
    //     String #2: "that is over or completed within a short interval of time..."
    private List<String> definitions;
}

因此,该网页实际上包含List<Metadata>,其中每个Metadata元素都是与1 +定义配对的单词类型。

我能够使用非常简单的API调用找到单词类型列表:

// Contains 1 Element for each word type, like "adjective", "noun", etc.
Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();
Elements wordTypes = doc.select("div.body div.pbk span.pg");

但是我正在努力弄清楚为了获得每个doc.select(...)实例我需要做些什么Metadata。任何对CSS选择器具有良好诀窍并熟练掌握JSoup的人都有什么想法?提前谢谢!

2 个答案:

答案 0 :(得分:2)

如果你看看Jsoup从这个页面得到的HTML,你会看到像

这样的东西
  <div class="body"> 
     <div class="pbk"> 
      <span class="pg">adjective </span> 
      <div class="luna-Ent">
       <span class="dnindex">1.</span>
       <div class="dndata">
        done, proceeding, or occurring with promptness or rapidity, as an action, process, etc.; prompt; immediate: 
        <span class="ital-inline">a quick response.</span> 
       </div>
      </div>
      <div class="luna-Ent">
       <span class="dnindex">2.</span>
       <div class="dndata">
        that is over or completed within a short interval of time: 
        <span class="ital-inline">a quick shower.</span> 
       </div>
      </div>
...
     <div class="pbk"> 
      <span class="pg">adverb </span> 
      <div class="luna-Ent">
       <span class="dnindex">19.</span>
       <div class="dndata">
        <a style="font-style:normal; font-weight:normal;" href="/browse/quickly">quickly</a>.
       </div>
      </div> 
     </div> 

所以每个部分

adjective
    1. done, proceeding, or occurring with promptness or rapidity...
    2. that is over or completed within a short interval of time
    ...
    14. Archaic.
        a. endowed with life
        b. having a high degree of vigor, energy, ...
noun
    1. living persons; the quick and the dead
    2. the tender, sensitive flesh of the living body...
    ...
adverb
    ...

位于<div class="pbk">内,其中包含<span class="pg">adjective </span>,其中包含部分名称和div <div class="luna-Ent">中的定义。所以你可以尝试做类似

的事情
Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();

Elements sections = doc.select("div.body div.pbk");
for (Element element : sections) {
    String elementType = element.getElementsByClass("pg").text();
    System.out.println("--------------------");
    System.out.println(elementType);

    for (Element definitions : element.getElementsByClass("luna-Ent"))
        System.out.println(definitions.text());

}

此代码将选择所有部分,并使用element.getElementsByClass("pg")找到部分的名称,并使用它们在div中具有类luna-Ent element.getElementsByClass("luna-Ent")的事实定义(如果您想跳过数字1.2.您可以选​​择dndata班级而不是luna-Ent

输出:

--------------------
adjective
1. done, proceeding, or occurring with promptness or rapidity, as an action, process, etc.; prompt; immediate: a quick response.
2. that is over or completed within a short interval of time: a quick shower.
3. moving, or able to move, with speed: a quick fox; a quick train.
4. swift or rapid, as motion: a quick flick of the wrist.
5. easily provoked or excited; hasty: a quick temper.
6. keenly responsive; lively; acute: a quick wit.
7. acting with swiftness or rapidity: a quick worker.
8. prompt or swift to do something: quick to respond.
9. prompt to perceive; sensitive: a quick eye.
10. prompt to understand, learn, etc.; of ready intelligence: a quick student.
11. (of a bend or curve) sharp: a quick bend in the road.
12. consisting of living plants: a quick pot of flowers.
13. brisk, as fire, flames, heat, etc.
14. Archaic. a. endowed with life. b. having a high degree of vigor, energy, or activity.
--------------------
noun
15. living persons: the quick and the dead.
16. the tender, sensitive flesh of the living body, especially that under the nails: nails bitten down to the quick.
17. the vital or most important part.
18. Chiefly British. a. a line of shrubs or plants, especially of hawthorn, forming a hedge. b. a single shrub or plant in such a hedge.
--------------------
adverb
19. quickly.

答案 1 :(得分:0)

你去吧。顺便说一句,要测试CSS选择器,您可以在Chrome Developer工具中激活控制台,并直接在他们的网站上测试这样的查询:jQuery('div.body div.pbk div.luna-Ent > .dndata')

Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();
Elements wordTypes = doc.select("div.body div.pbk");

for (Element wordType : wordTypes) {
    Elements typeOfSpeech = wordType.select("span.pg");

    System.out.println("typeOfSpeech: " + typeOfSpeech.text());

    Elements elements = wordType.select("div.luna-Ent > .dndata");

    for (int i = 0; i < elements.size(); i++) {
        Element element = elements.get(i);
        System.out.println((i + 1) + ". " + element.text());
    }
}