Java中的精确XPATH位置

时间:2013-05-02 06:57:24

标签: xpath rapidminer

我正在尝试返回确切的XPATH查询表达式,因此我可以使用rapidminer对网站进行数据处理。 我需要一个查询来分别隔离每一行:

  

2012年11月7日星期三

     

TROLL

     

9999999999999

     

12年7月11日

     

CONNOTE FILE LODGED

     

星期二20/11/2012 1:12 PM

到目前为止我只有//td[@class='select']/text()

注意:值将更改,因此查询需要特定于位置。

每个值的六个独立查询是什么?

        <tr>
          <td class="select" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
            Wed 7/11/2012<br>
            TROLL&nbsp;

          </td>
          <td class="select" align="center" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
            9999999999999
            <br>07.11.12

            &nbsp;
          </td>
          <td class="select" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">




                CONNOTE FILE LODGED <br>
                Tue 20/11/2012 1:12 PM
              &nbsp;



&nbsp;
          </td>

        </tr>

    </table>

1 个答案:

答案 0 :(得分:0)

使用Ruby库Nokogiri(位于libxml2之上,实现XPath 1.0)来测试:

XPATHS = %w{
  //tr/td[1]/text()[1]
  //tr/td[1]/text()[2]
  //tr/td[2]/text()[1]
  //tr/td[2]/text()[2]
  //tr/td[3]/text()[1]
  //tr/td[3]/text()[2]
}

require 'nokogiri'
d = Nokogiri.HTML(html)

XPATHS.each{ |expression| p d.at_xpath(expression).content }
#=> "\n            Wed 7/11/2012"
#=> "\n            TROLL\u00A0\n\n          "
#=> "\n            9999999999999\n            "
#=> "07.11.12\n\n            \u00A0\n          "
#=> "\n\n\n\n\n                CONNOTE FILE LODGED "
#=> "\n                Tue 20/11/2012 1:12 PM\n              \u00A0\n\n\n\n\u00A0\n          "

正如您所看到的,文本节点包含许多您可能想要删除的额外前导和尾随空格。我们可以使用normalize-space删除它:

XPATHS = %w{
  normalize-space(//tr/td[1]/text()[1])
  normalize-space(//tr/td[1]/text()[2])
  normalize-space(//tr/td[2]/text()[1])
  normalize-space(//tr/td[2]/text()[2])
  normalize-space(//tr/td[3]/text()[1])
  normalize-space(//tr/td[3]/text()[2])
}

XPATHS.each{ |expression| p d.xpath(expression) }
#=> "Wed 7/11/2012"
#=> "TROLL\u00A0"
#=> "9999999999999"
#=> "07.11.12 \u00A0"
#=> "CONNOTE FILE LODGED"
#=> "Tue 20/11/2012 1:12 PM \u00A0 \u00A0"