JSoup:如何获取带有某些标签的信息?

时间:2018-07-31 13:19:26

标签: java jsoup html-parsing

我有这样的html页面:

<a name="robots"></a>
<div class="dnb">
   <div class="info_outer">
      <div class="info">
         <div class="name"><a href="/p/1/">TEXT1</a> <span class="t">TEXT2</span></div>
         <div class="role">SOMEROLE1</div>
      </div>
   </div>
</div>
<a name="humans"></a>
<div class="dnb">
   <div class="info_outer">
      <div class="info">
         <div class="name"><a href="/p/1/">TEXT3</a> <span class="t">TEXT4</span></div>
         <div class="role">SOMEROLE2</div>
      </div>
   </div>
</div>
<div class="dnb">
   <div class="info_outer">
      <div class="info">
         <div class="name"><a href="/p/1/">TEXT5</a> <span class="t">TEXT6</span></div>
         <div class="role">SOMEROLE3</div>
      </div>
   </div>
</div>
<div class="dnb">
   <div class="info_outer">
      <div class="info">
         <div class="name"><a href="/p/1/">TEXT7</a> <span class="t">TEXT8</span></div>
         <div class="role">SOMEROLE4</div>
      </div>
   </div>
</div>

我需要从这些div(名称和角色)获取信息。但是只能从那些属于“人类”分隔符的对象中分离出来。 JSoup有可能吗?

1 个答案:

答案 0 :(得分:1)

是的,有可能。学习selector syntax

import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JSoupSelectors {
    public static void main(String[] args) throws IOException {
        File input = new File("WeAreTheRobots.xml");
        Document doc = Jsoup.parse(input, null);
        for (Element human : doc.select("a[name=humans]")) {
            Element info = human.nextElementSibling().selectFirst("div.dnb>div.info_outer>div.info");
            String name = info.selectFirst(">div.name>span.t").ownText();
            System.out.println("Name = " + name);
            String role = info.selectFirst(">div.role").ownText();
            System.out.println("Role = " + role);
        }
    }
}

输出:

Name = TEXT4
Role = SOMEROLE2