解析复杂的li标签

时间:2017-09-13 11:54:25

标签: java html parsing jsoup html-parsing

我正在尝试使用Jsoup解析HTML文件。 HTML中的某些文本不属于标记。

<li class="inactive"> 
  <span class="status label">inactive</span> 
  <a href="/officers/144662696" class="officer inactive" title="more info on MILLTOWN CORPORATE SERVICES">
     MILLTOWN CORPORATE SERVICES
  </a>
  member, 
  <span class="status label">inactive</span> 
  <a href="/companies/us_wv/193180" class="company inactive revoked_(failure_to_file_annual_report)" title="More Free And Open Company Data On EASTBRIDGE L.L.C. (West Virginia (US), 193180)">
    EASTBRIDGE L.L.C.
   </a> 
   (West Virginia (US), 
   <span class="start_date">25 May 2000</span>-<span class="end_date"> 1 Aug 2002</span>)  
</li>

我能够阅读标签中的所有内容,但我正在尝试获取值(美国西弗吉尼亚州)成员

有没有办法在类之外和li标记内获取值。

2 个答案:

答案 0 :(得分:0)

您可能正在寻找类似Element#ownText的内容。

这只获取当前元素的文本,而不是所有子元素的组合文本。

Element listItem = doc.select("li.inactive").first();
System.out.println(listItem.ownText()); // prints "member, (West Virginia (US), -)"

答案 1 :(得分:0)

您还可以使用以前的标记来获取未嵌入任何标记的文本节点。如果我做对了,你想在每个标签后得到每个文本节点。尝试类似:

    String html = "<li class=\"inactive\"> \n"
            + "  <span class=\"status label\">inactive</span> \n"
            + "  <a href=\"/officers/144662696\" class=\"officer inactive\" title=\"more info on MILLTOWN CORPORATE SERVICES\">\n"
            + "     MILLTOWN CORPORATE SERVICES\n"
            + "  </a>\n"
            + "  member, \n"
            + "  <span class=\"status label\">inactive</span> \n"
            + "  <a href=\"/companies/us_wv/193180\" class=\"company inactive revoked_(failure_to_file_annual_report)\" title=\"More Free And Open Company Data On EASTBRIDGE L.L.C. (West Virginia (US), 193180)\">\n"
            + "    EASTBRIDGE L.L.C.\n"
            + "   </a> \n"
            + "   (West Virginia (US), \n"
            + "   <span class=\"start_date\">25 May 2000</span>-<span class=\"end_date\"> 1 Aug 2002</span>)  \n"
            + "</li>";

    Document doc = Jsoup.parse(html);
    Elements links = doc.select("a");
    for(Element e : links){
        System.out.println(e.nextSibling().toString());
    }
相关问题