我正在尝试返回确切的XPATH查询表达式,因此我可以使用rapidminer对网站进行数据处理。 我需要一个查询来分别隔离每一行:
2012年11月7日星期三
TROLL
9999999999999
12年7月11日
CONNOTE FILE LODGED
星期二20/11/2012 1:12 PM
到目前为止我只有//td[@class='select']/text()
注意:值将更改,因此查询需要特定于位置。
每个值的六个独立查询是什么?
<tr>
<td class="select" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
Wed 7/11/2012<br>
TROLL
</td>
<td class="select" align="center" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
9999999999999
<br>07.11.12
</td>
<td class="select" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
CONNOTE FILE LODGED <br>
Tue 20/11/2012 1:12 PM
</td>
</tr>
</table>
答案 0 :(得分:0)
使用Ruby库Nokogiri(位于libxml2之上,实现XPath 1.0)来测试:
XPATHS = %w{
//tr/td[1]/text()[1]
//tr/td[1]/text()[2]
//tr/td[2]/text()[1]
//tr/td[2]/text()[2]
//tr/td[3]/text()[1]
//tr/td[3]/text()[2]
}
require 'nokogiri'
d = Nokogiri.HTML(html)
XPATHS.each{ |expression| p d.at_xpath(expression).content }
#=> "\n Wed 7/11/2012"
#=> "\n TROLL\u00A0\n\n "
#=> "\n 9999999999999\n "
#=> "07.11.12\n\n \u00A0\n "
#=> "\n\n\n\n\n CONNOTE FILE LODGED "
#=> "\n Tue 20/11/2012 1:12 PM\n \u00A0\n\n\n\n\u00A0\n "
正如您所看到的,文本节点包含许多您可能想要删除的额外前导和尾随空格。我们可以使用normalize-space
删除它:
XPATHS = %w{
normalize-space(//tr/td[1]/text()[1])
normalize-space(//tr/td[1]/text()[2])
normalize-space(//tr/td[2]/text()[1])
normalize-space(//tr/td[2]/text()[2])
normalize-space(//tr/td[3]/text()[1])
normalize-space(//tr/td[3]/text()[2])
}
XPATHS.each{ |expression| p d.xpath(expression) }
#=> "Wed 7/11/2012"
#=> "TROLL\u00A0"
#=> "9999999999999"
#=> "07.11.12 \u00A0"
#=> "CONNOTE FILE LODGED"
#=> "Tue 20/11/2012 1:12 PM \u00A0 \u00A0"