解析html源代码以获得预期值

时间:2017-09-21 04:01:46

标签: java java-ee

要解析的HTML代码:

<table width="100%" border="0" cellpadding="0" cellspacing="0" class="ms-bottompaging" xmlns:x="http://www.w3.org/2001/XMLSchema" xmlns:d="http://schemas.microsoft.com/" xmlns:asp="http://schemas.microsoft.com/ASPNET/20" xmlns:pcm="urn:PageContentManager" xmlns:ddwrt2="urn:frontpage:internal">
     <tbody>
      <tr>
       <td class="ms-bottompagingline1"><img src="/_images/11/images/blank.gif?rev=40" width="1" height="1" alt="" /></td>
      </tr>
      <tr>
       <td class="ms-bottompagingline2"><img src="/_images/11/images/blank.gif?rev=40" width="1" height="1" alt="" /></td>
      </tr>
      <tr>
       <td class="ms-vb" id="bottomPagingCellWPQ2" align="center">
        <table>
         <tbody>
          <tr>
           <td class="ms-paging">1 - 15</td>
           <td><a onclick="javascript:RefreshPageTo(event, &quot;/sites/myAppDetail/My%20Documents/Forms/AllApplicationss.aspx?Paged=TRUE&amp;p_SortBehavior=0&amp;p_FileLeafRef=LT%5fSW%20TEAM%5fNatural%5fItemCode%5f20170909%5fvstatus%2epdf&amp;p_ID=85&amp;RootFolder=%2fmyData%2fFolder3%2fCommon%20Docs%2fdaily%20Report%2f2017&amp;PageFirstRow=16&amp;&amp;View={05465DFA-110E-21FC-8AD6-8B9846567FF8B}&quot;);javascript:return false;" href="javascript:"><img src="/_layouts/15/1011/images/next.gif" border="0" alt="Next" /></a></td>
          </tr>
         </tbody>
        </table></td>
      </tr>
  <tr>.......

如何从上面的html代码中获取<a onClick="..">的值。

预期产出:

&quot;/sites/myAppDetail/My%20Documents/Forms/AllApplicationss.aspx?Paged=TRUE&amp;p_SortBehavior=0&amp;p_FileLeafRef=LT%5fSW%20TEAM%5fNatural%5fItemCode%5f20170909%5fvstatus%2epdf&amp;p_ID=85&amp;RootFolder=%2fmyData%2fFolder3%2fCommon%20Docs%2fdaily%20Report%2f2017&amp;PageFirstRow=16&amp;&amp;View={05465DFA-110E-21FC-8AD6-8B9846567FF8B}&quot;

我尝试使用以下代码,但输出不符合预期。

File input = new File("myHtml.html");
          Document doc = Jsoup.parse(input, "UTF-8");
          Elements links = doc.select(".ms-paging > td > a"); //get the value stored inside <a onClick="javascript:RefreshPageTo(event, &quot...)"> near  <td class="ms-paging">1 - 15</td>;
          System.out.println("size : "+ links.size()); //0
          for (Element link : links) {
              System.out.println(link);//empty, it should print the link
          }

1 个答案:

答案 0 :(得分:0)

您需要使用~指定td旁边的td class="ms-paging"元素。以下为我工作

Document doc = Jsoup.parse(input, "UTF-8");
Elements elements = doc.select("td.ms-paging ~ td > a") ;
for(Element e : elements) {
    String attrValue  = e.attr("onclick");
    System.out.println(attrValue.substring(attrValue.indexOf("\"") + 1,
                       attrValue.lastIndexOf("\"")));
}

将打印预期值

/sites/myAppDetail/My%20Documents/Forms/AllApplicationss.aspx?Paged=TRUE&p_SortBehavior=0&p_FileLeafRef=LT%5fSW%20TEAM%5fNatural%5fItemCode%5f20170909%5fvstatus%2epdf&p_ID=85&RootFolder=%2fmyData%2fFolder3%2fCommon%20Docs%2fdaily%20Report%2f2017&PageFirstRow=16&&View={05465DFA-110E-21FC-8AD6-8B9846567FF8B}

希望它有所帮助!