如何在同一元素jsoup中选择具有相同标签的子元素?

时间:2018-01-16 09:41:14

标签: java html jsoup

我需要通过jsoup通过元素标记divh3a等来解析页面。我想解析div.g元素并获得这些课程的文字:a class="l _PMs"a class="_pJs"将显示在jList中。

以Google新闻为例,该页面如下所示:

<div class="g">
    <div class="ts _JGs _KHs _oGs _KGs _jHs">
        <a class="top _xGs _SHs" href="url" onmousedown="return rwt(this,'','','','1','dfda','','sdfa','','',event)">
            <img class="th _RGs" src="url" alt="Story image" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)">
        </a>
        <div class="_hJs">
            <h3 class="r _gJs">
                <a class="l _PMs" href="url" onmousedown="return rwt(this,'','','','1','dfs','','sdfa','','',event)">Report on <em>Example</em> Testing<em>Club</em> ...</a>
            </h3>
            <div class="slp">
                <span class="_OHs _PHs">link</span>
                <span class="_QGs">-</span>
                <span class="f nsa _QHs">date</span>
            </div>
            <div class="st">description</div>
        </div>
        <div class="_sJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','sdf','','sdfa','','',event)" data-href="url">Final review of <em>example's</em> of <em>testing</em>...
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="_sJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','dfa','','dfs-d','','',event)" data-href="url">Report on this testing
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="_eJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','ad','','dfsaf','','',event)">Test report example
            </a>
        </div>
        <div class="_cJs"></div>
    </div>
</div>

<div class="g">
    <div class="ts _JGs _KHs _oGs _KGs _jHs">
        <a class="top _xGs _SHs" href="url" onmousedown="return rwt(this,'','','','1','dfda','','sdfa','','',event)">
            <img class="th _RGs" src="url" alt="Story image" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)">
        </a>
        <div class="_hJs">
            <h3 class="r _gJs">
                <a class="l _PMs" href="url" onmousedown="return rwt(this,'','','','1','dfs','','sdfa','','',event)">Cloud<em>Example</em> Testing<em>1</em> ...</a>
            </h3>
            <div class="slp">
                <span class="_OHs _PHs">link</span>
                <span class="_QGs">-</span>
                <span class="f nsa _QHs">date</span>
            </div>
            <div class="st">description</div>
        </div>
        <div class="_sJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','sdf','','sdfa','','',event)" data-href="url">Final review of this<em>testing</em>...
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="_sJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','dfa','','dfs-d','','',event)" data-href="url">Report on this...
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="_eJs card-section">
            <a class="_pJs" href="url" onmousedown="return rwt(this,'','','','1','ad','','dfsaf','','',event)">Example 2...
            </a>
        </div>
        <div class="_cJs"></div>
        <div class="tsw _QMs">
            <div class="_jJs card-section">
                <a class="_MHs" href="url" target="_blank" onmousedown="return rwt(this,'','','','2','sdfs','','dfd','','',event)" data-href="url">
                    <img class="_iJs" id="news-media-image-52779751835836-0" src="url" alt="image1" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)">
                    <div class="_RMs">USA TODAY.</div>
                </a>
                <a class="_MHs" href="url" target="_blank" onmousedown="return rwt(this,'','','','2','sdfsa','','dsfa','','',event)">
                    <img class="_iJs" id="news-media-image-52779751835836-1" src="url" alt="image2" onload="typeof google==='object'&amp;&amp;google.aft&amp;&amp;google.aft(this)">
                    <div class="_RMs">image2./div>
                </a>
            </div>
            <div class="_NMs">
                <a class="_OMs" href="url">View all
                </a>
            </div>
        </div>
    </div>
</div>

这是代码:

String input = txtSearch.getText();
input = input.replace(" ", "+");
String url = "http://www.google.com/search?q=" + input + "&tbm=nws&source=lnms";
try {
    Document doc = Jsoup.connect(url).userAgent("Chrome").timeout(5000).get();
    Elements e = doc.select("div.g");
    DefaultListModel<String> listModel = new DefaultListModel<>();
    e.forEach((e1) -> {
        e1.getElementsByTag("a").forEach(linkElement -> listModel.addElement(linkElement.text()));
    });
    newsList.setModel(listModel);            
} catch (IOException ex) {
    Logger.getLogger(MainUI.class.getName()).log(Level.SEVERE, null, ex);
}

jList中显示的实际输出为:

Report on Example Testing Club...  
Final review of example's of testing...  
Report on this testing.  
Test report example.
Cloud Example Testing 1.   
Final review of this testing.   
Report on this...   
Example 2...   
USA TODAY.   
image2.   
View all

如何选择这些类:a class="l _PMs"a class="_pJs"未选择a class=_MHsa class=_OMs,如下所示(jList中):

Report on Example Testing Club...  
Final review of example's of testing...  
Report on this testing.  
Test report example.
Cloud Example Testing 1.   
Final review of this testing.   
Report on this...   
Example 2...

2 个答案:

答案 0 :(得分:0)

只需更改此行:

Elements e = doc.select("div.g");

Elements e = doc.select("div.g").select("div.a");

在循环中只检查文本,如:

    for(Element element:e)
       {
          yourList.add(e.text());
       }

元素e = doc.select(“div.g”)。select(“a”);我们将有一个div.g标签的每个标签元素的列表。所以现在我们可以通过for循环现在遍历每个标签并查找文本甚至属性..

答案 1 :(得分:0)

问题是您选择了给定a内的所有div元素,并在此列表中调用.text()方法 - 它自然会返回所有{{1}的连接文本元素。

要使代码按预期工作,您可以更改:

a

为:

e.forEach((e1) -> {
    listModel.addElement(e1.getElementsByTag("a").text());
});

更新

如果您只想选择e.forEach((e1) -> { e1.getElementsByTag("a").forEach(linkElement -> listModel.addElement(linkElement.text())); }); + al类的_PMs元素,您可以像这样重写代码:

_pJs

选择器为:Document doc = Jsoup.connect(url).userAgent("Chrome").timeout(5000).get(); DefaultListModel<String> listModel = new DefaultListModel<>(); doc.select("div.g a.l._PMs, div.g a._pJs") .forEach(element -> listModel.addElement(element.text())); newsList.setModel(listModel); ,表示选择满足以下条件之一的所有元素:

  • 它们位于div.g a.l._PMs, div.g a._pJs元素内,al类位于_PMs元素内div
  • 它们位于g元素内,a类位于_pJs元素内div