用于解析html页面的xpath查询

时间:2014-05-15 01:55:59

标签: html parsing xpath

我有一个看起来像

的html文件
<HTML>
        <BODY>
            <TABLE width="100%" border="0" cellpadding="0" cellspacing="0">
                <tr>
                    <td height="400" align="right" valign="top" class="text_rail_left"></td>
                    <td width="100%" align="left" valign="top" class="text_back_color"><table border="0" cellPadding="0" cellSpacing="0" width="100%"><tr>

                    </tr><tr>
                        <td width="100%" align="left" align="top"><table width="100%" border="0" cellspacing="2" cellpadding="0">
                            <tr>
                                <td align="center" valign="top" class="inside_heading_text">Train Names with Details</td>
                            </tr> <tr>
                                <td><b><BR><BR> SORRY !!! No Matching buses Found</b></td></tr>
                            <tr><td>
                            </td></tr></table>
                        <td align="left" valign="top" class="pad_self"><table width="100%" border="0" cellspacing="2" cellpadding="2">
                            <tr><td align="right" valign="top">&nbsp;</td>
                            </tr></table></td>
                        </tr></table></td>
                    <td align="left" valign="top" class="text_rail_right">&nbsp;</td>
                </tr>
                <tr>
                    <td width="10" align="left" valign="top"><img src="http://www.indianrail.gov.in/main_text_left_bottom2.gif" alt="" width="8"/></td>
                    <td width="100%" align="left" valign="top" class="text_rail_bottom"><img src="http://www.indianrail.gov.in/blank.gif" alt="" width="1" height="8" /></td>
                    <td width="10" align="right" valign="top"> <img src="http://www.indianrail.gov.in/main_text_right_bottom2.gif" alt="" width="8" /></td>
                </tr></table><body>
                    <FONT size=1>No. of Queries :  0839425885
                        , &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Server : YAMUNA
                        , &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Dated : 15-05-2014 Time:07:15:26 Hrs</font></td></tr></table></td></tr> </table></td></tr></table></td></tr></table></td></tr><tr><td align="left"valign="top"><table width="970" border="0" cellspacing="0" cellpadding="0"><tr> <td width="9" align="left" valign="top"><img src="http://www.indianrail.gov.in/images/footer_upper_lft.gif" alt="" width="9" height="49" /></td><td width="100%%" align="left" valign="top" class="footer_upper"><table width="100%%" border="0" cellspacing="1" cellpadding="0"><tr><td align="center" valign="top" class="main_footer_upper"><a href="../index.html"  onclick="resetButton()">Home </a> | <a href="http://www.indianrailways.gov.in/railwayboard/" target="_blank">Ministry of Railways</a> |      <a href="../know_Station_Code.html" onclick="resetButton()">Trains between Stations</a> | <a href="../booking_Location.html" onclick="resetButton()">Booking Locations</a> | <a href="http://www.cris.org.in/" target="_blank">CRIS</a> | <a href="../about_Concert.html"  onclick="resetButton()">CONCERT</a> | <a href="../advertisement.html"  onclick="resetButton()">Advertise with CRIS</a> | <a href="http://www.indianrail.gov.in/images/rail-map.jpg" target="_blank">Railway Map</a> | <a href="../faq.html"  onclick="resetButton()">FAQ</a> | <a href="../sitemap.html"  onclick="resetButton()">Sitemap</a> | <a href="http://www.trainenquiry.com/Feedback.aspx" target="_blank" onclick="resetButton()">Feedback</a></td></tr><tr><td align="center"valign="top" class="copy_footer" style="padding-top:3px;"><span class="main_footer_copy"><a href="../copyright.html"  onclick="resetButton()">Copyright</a></span> &copy; 2010, Centre For Railway Information Systems, Designed and Hosted by CRIS | <span class="main_footer_copy"><a href="../disclaimer.html" onclick="resetButton()">Disclaimer</a></span><br />Best viewed at 1024 x 768 resolution with Internet Explorer 5.0 or Mozila Firefox 3.5 and higher</td></tr> </table></td><td width="9" align="right" valign="top"><img src="http://www.indianrail.gov.in/images/footer_upper_rgt.gif" alt="" width="9" height="49" /></td></tr></table></td></tr></table></td></tr></table><script type="text/javascript">anylinkmenu.init("menuanchorclass")</script>
    </BODY>
</HTML>

我想写一个xpath查询来读取字符串

SORRY !!! No Matching buses Found

没有唯一的类用字符串标识类。我尝试了xpath查询

@"//td[@class='inside_heading_text']/tr"

但它似乎无法奏效。

有人能指出我正确的方向吗?我正在使用Objective-C中的ONO库来解析html。

2 个答案:

答案 0 :(得分:1)

好吧,这会让你成为&#34; SORRY&#34;的容器。文字

//*[contains(text(),'SORRY')]

我建议使用Firebug的firefinder扩展(在firefox上),以便轻松尝试使用xpath。

答案 1 :(得分:1)

那是你那里的一些丑陋的HTML。

有未闭合的元素,重复的td/@align属性等。如果你想使用XPath,你将不得不首先清理它。

如果您至少可以手动或自动清理它:

<?xml version="1.0" encoding="utf-8"?>
<HTML>
  <BODY>
    <TABLE width="100%" border="0" cellpadding="0" cellspacing="0">
      <tr>
        <td height="400" align="right" valign="top" class="text_rail_left">
        </td>
        <td width="100%" align="left" valign="top" class="text_back_color">
          <table border="0" cellPadding="0" cellSpacing="0" width="100%">
            <tr>
            </tr>
            <tr>
              <td width="100%" align="left">
                <table width="100%" border="0" cellspacing="2" cellpadding="0">
                  <tr>
                    <td align="center" valign="top" class="inside_heading_text">Train Names with Details</td>
                  </tr>
                  <tr>
                    <td>
                      <b>
                        <BR/>
                      <BR/> SORRY !!! No Matching buses Found</b>
                    </td>
                  </tr>
                  <tr>
                    <td>
                    </td>
                  </tr>
                </table>
              </td>
            </tr>
          </table>
        </td>
        <td align="left" valign="top" class="text_rail_right"></td>
      </tr>

    </TABLE>
  </BODY>
</HTML>

然后这个XPath将在您提到的inside_heading_text参考点选择“SORRY ...”文本:

//td[@class='inside_heading_text']/../following-sibling::tr[1]/td[1]/b