我正在尝试使用JSoup从http://dictionary.reference.com/browse/quick获取一些内容。如果你去那个页面,你会发现他们组织数据的方式是将 quick 这个词的每个“单词类型”(形容词,动词,名词)呈现为它自己的部分,并且每个部分包含1个以上的定义列表。
为了使事情变得更复杂,每个定义中的每个单词都链接到另一个dictionary.com页面:
quick
adjective
1. done, proceeding, or occurring with promptness or rapidity...
2. that is over or completed within a short interval of time
...
14. Archaic.
a. endowed with life
b. having a high degree of vigor, energy, ...
noun
1. living persons; the quick and the dead
2. the tender, sensitive flesh of the living body...
...
adverb
...
我想要做的是使用JSoup来获取单词类型及其各自的定义作为字符串列表,如下所示:
public class Metadata {
// Ex: "adjective", "noun", etc.
private String wordType;
// Ex: String #1: "1. done, proceeding, or occurring with promptness or rapidity..."
// String #2: "that is over or completed within a short interval of time..."
private List<String> definitions;
}
因此,该网页实际上包含List<Metadata>
,其中每个Metadata
元素都是与1 +定义配对的单词类型。
我能够使用非常简单的API调用找到单词类型列表:
// Contains 1 Element for each word type, like "adjective", "noun", etc.
Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();
Elements wordTypes = doc.select("div.body div.pbk span.pg");
但是我正在努力弄清楚为了获得每个doc.select(...)
实例我需要做些什么Metadata
。任何对CSS选择器具有良好诀窍并熟练掌握JSoup的人都有什么想法?提前谢谢!
答案 0 :(得分:2)
如果你看看Jsoup从这个页面得到的HTML,你会看到像
这样的东西 <div class="body">
<div class="pbk">
<span class="pg">adjective </span>
<div class="luna-Ent">
<span class="dnindex">1.</span>
<div class="dndata">
done, proceeding, or occurring with promptness or rapidity, as an action, process, etc.; prompt; immediate:
<span class="ital-inline">a quick response.</span>
</div>
</div>
<div class="luna-Ent">
<span class="dnindex">2.</span>
<div class="dndata">
that is over or completed within a short interval of time:
<span class="ital-inline">a quick shower.</span>
</div>
</div>
...
<div class="pbk">
<span class="pg">adverb </span>
<div class="luna-Ent">
<span class="dnindex">19.</span>
<div class="dndata">
<a style="font-style:normal; font-weight:normal;" href="/browse/quickly">quickly</a>.
</div>
</div>
</div>
所以每个部分
adjective
1. done, proceeding, or occurring with promptness or rapidity...
2. that is over or completed within a short interval of time
...
14. Archaic.
a. endowed with life
b. having a high degree of vigor, energy, ...
noun
1. living persons; the quick and the dead
2. the tender, sensitive flesh of the living body...
...
adverb
...
位于<div class="pbk">
内,其中包含<span class="pg">adjective </span>
,其中包含部分名称和div <div class="luna-Ent">
中的定义。所以你可以尝试做类似
Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();
Elements sections = doc.select("div.body div.pbk");
for (Element element : sections) {
String elementType = element.getElementsByClass("pg").text();
System.out.println("--------------------");
System.out.println(elementType);
for (Element definitions : element.getElementsByClass("luna-Ent"))
System.out.println(definitions.text());
}
此代码将选择所有部分,并使用element.getElementsByClass("pg")
找到部分的名称,并使用它们在div中具有类luna-Ent
element.getElementsByClass("luna-Ent")
的事实定义(如果您想跳过数字1.
,2.
您可以选择dndata
班级而不是luna-Ent
)
输出:
--------------------
adjective
1. done, proceeding, or occurring with promptness or rapidity, as an action, process, etc.; prompt; immediate: a quick response.
2. that is over or completed within a short interval of time: a quick shower.
3. moving, or able to move, with speed: a quick fox; a quick train.
4. swift or rapid, as motion: a quick flick of the wrist.
5. easily provoked or excited; hasty: a quick temper.
6. keenly responsive; lively; acute: a quick wit.
7. acting with swiftness or rapidity: a quick worker.
8. prompt or swift to do something: quick to respond.
9. prompt to perceive; sensitive: a quick eye.
10. prompt to understand, learn, etc.; of ready intelligence: a quick student.
11. (of a bend or curve) sharp: a quick bend in the road.
12. consisting of living plants: a quick pot of flowers.
13. brisk, as fire, flames, heat, etc.
14. Archaic. a. endowed with life. b. having a high degree of vigor, energy, or activity.
--------------------
noun
15. living persons: the quick and the dead.
16. the tender, sensitive flesh of the living body, especially that under the nails: nails bitten down to the quick.
17. the vital or most important part.
18. Chiefly British. a. a line of shrubs or plants, especially of hawthorn, forming a hedge. b. a single shrub or plant in such a hedge.
--------------------
adverb
19. quickly.
答案 1 :(得分:0)
你去吧。顺便说一句,要测试CSS选择器,您可以在Chrome Developer工具中激活控制台,并直接在他们的网站上测试这样的查询:jQuery('div.body div.pbk div.luna-Ent > .dndata')
Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();
Elements wordTypes = doc.select("div.body div.pbk");
for (Element wordType : wordTypes) {
Elements typeOfSpeech = wordType.select("span.pg");
System.out.println("typeOfSpeech: " + typeOfSpeech.text());
Elements elements = wordType.select("div.luna-Ent > .dndata");
for (int i = 0; i < elements.size(); i++) {
Element element = elements.get(i);
System.out.println((i + 1) + ". " + element.text());
}
}