我需要通过jsoup通过元素标记div
,h3
,a
等来解析页面。我想解析div.g
元素并获得这些课程的文字:a class="l _PMs"
和a class="_pJs"
将显示在jList
中。
以Google新闻为例,该页面如下所示:
<div class="g">
<div class="ts _JGs _KHs _oGs _KGs _jHs">
<a class="top _xGs _SHs" href="url" onmousedown="return rwt(this,'','','','1','dfda','','sdfa','','',event)">
<img class="th _RGs" src="url" alt="Story image" onload="typeof google==='object'&&google.aft&&google.aft(this)">
</a>
<div class="_hJs">
<h3 class="r _gJs">
<a class="l _PMs" href="url" onmousedown="return rwt(this,'','','','1','dfs','','sdfa','','',event)">Report on <em>Example</em> Testing<em>Club</em> ...</a>
</h3>
<div class="slp">
<span class="_OHs _PHs">link</span>
<span class="_QGs">-</span>
<span class="f nsa _QHs">date</span>
</div>
<div class="st">description</div>
</div>
<div class="_sJs card-section">
<a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','sdf','','sdfa','','',event)" data-href="url">Final review of <em>example's</em> of <em>testing</em>...
</a>
</div>
<div class="_cJs"></div>
<div class="_sJs card-section">
<a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','dfa','','dfs-d','','',event)" data-href="url">Report on this testing
</a>
</div>
<div class="_cJs"></div>
<div class="_eJs card-section">
<a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','ad','','dfsaf','','',event)">Test report example
</a>
</div>
<div class="_cJs"></div>
</div>
</div>
<div class="g">
<div class="ts _JGs _KHs _oGs _KGs _jHs">
<a class="top _xGs _SHs" href="url" onmousedown="return rwt(this,'','','','1','dfda','','sdfa','','',event)">
<img class="th _RGs" src="url" alt="Story image" onload="typeof google==='object'&&google.aft&&google.aft(this)">
</a>
<div class="_hJs">
<h3 class="r _gJs">
<a class="l _PMs" href="url" onmousedown="return rwt(this,'','','','1','dfs','','sdfa','','',event)">Cloud<em>Example</em> Testing<em>1</em> ...</a>
</h3>
<div class="slp">
<span class="_OHs _PHs">link</span>
<span class="_QGs">-</span>
<span class="f nsa _QHs">date</span>
</div>
<div class="st">description</div>
</div>
<div class="_sJs card-section">
<a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','sdf','','sdfa','','',event)" data-href="url">Final review of this<em>testing</em>...
</a>
</div>
<div class="_cJs"></div>
<div class="_sJs card-section">
<a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','dfa','','dfs-d','','',event)" data-href="url">Report on this...
</a>
</div>
<div class="_cJs"></div>
<div class="_eJs card-section">
<a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','ad','','dfsaf','','',event)">Example 2...
</a>
</div>
<div class="_cJs"></div>
<div class="tsw _QMs">
<div class="_jJs card-section">
<a class="_MHs" href="url" target="_blank" onmousedown="return rwt(this,'','','','2','sdfs','','dfd','','',event)" data-href="url">
<img class="_iJs" id="news-media-image-52779751835836-0" src="url" alt="image1" onload="typeof google==='object'&&google.aft&&google.aft(this)">
<div class="_RMs">USA TODAY.</div>
</a>
<a class="_MHs" href="url" target="_blank" onmousedown="return rwt(this,'','','','2','sdfsa','','dsfa','','',event)">
<img class="_iJs" id="news-media-image-52779751835836-1" src="url" alt="image2" onload="typeof google==='object'&&google.aft&&google.aft(this)">
<div class="_RMs">image2./div>
</a>
</div>
<div class="_NMs">
<a class="_OMs" href="url">View all
</a>
</div>
</div>
</div>
</div>
这是代码:
String input = txtSearch.getText();
input = input.replace(" ", "+");
String url = "http://www.google.com/search?q=" + input + "&tbm=nws&source=lnms";
try {
Document doc = Jsoup.connect(url).userAgent("Chrome").timeout(5000).get();
Elements e = doc.select("div.g");
DefaultListModel<String> listModel = new DefaultListModel<>();
e.forEach((e1) -> {
e1.getElementsByTag("a").forEach(linkElement -> listModel.addElement(linkElement.text()));
});
newsList.setModel(listModel);
} catch (IOException ex) {
Logger.getLogger(MainUI.class.getName()).log(Level.SEVERE, null, ex);
}
jList
中显示的实际输出为:
Report on Example Testing Club...
Final review of example's of testing...
Report on this testing.
Test report example.
Cloud Example Testing 1.
Final review of this testing.
Report on this...
Example 2...
USA TODAY.
image2.
View all
如何选择这些类:a class="l _PMs"
和a class="_pJs"
未选择a class=_MHs
和a class=_OMs
,如下所示(jList
中):
Report on Example Testing Club...
Final review of example's of testing...
Report on this testing.
Test report example.
Cloud Example Testing 1.
Final review of this testing.
Report on this...
Example 2...
答案 0 :(得分:0)
只需更改此行:
Elements e = doc.select("div.g");
到
Elements e = doc.select("div.g").select("div.a");
在循环中只检查文本,如:
for(Element element:e)
{
yourList.add(e.text());
}
元素e = doc.select(“div.g”)。select(“a”);我们将有一个div.g标签的每个标签元素的列表。所以现在我们可以通过for循环现在遍历每个标签并查找文本甚至属性..
答案 1 :(得分:0)
问题是您选择了给定a
内的所有div
元素,并在此列表中调用.text()
方法 - 它自然会返回所有{{1}的连接文本元素。
要使代码按预期工作,您可以更改:
a
为:
e.forEach((e1) -> {
listModel.addElement(e1.getElementsByTag("a").text());
});
更新
如果您只想选择e.forEach((e1) -> {
e1.getElementsByTag("a").forEach(linkElement -> listModel.addElement(linkElement.text()));
});
+ a
或l
类的_PMs
元素,您可以像这样重写代码:
_pJs
选择器为:Document doc = Jsoup.connect(url).userAgent("Chrome").timeout(5000).get();
DefaultListModel<String> listModel = new DefaultListModel<>();
doc.select("div.g a.l._PMs, div.g a._pJs")
.forEach(element -> listModel.addElement(element.text()));
newsList.setModel(listModel);
,表示选择满足以下条件之一的所有元素:
div.g a.l._PMs, div.g a._pJs
元素内,a
和l
类位于_PMs
元素内div
类g
元素内,a
类位于_pJs
元素内div
类