如何使用Nokogiri获取'th'元素的索引

时间:2015-05-14 04:30:22

标签: ruby nokogiri

我有以下HTML代码,需要使用<span> ID确定“字符串数”的索引。我正在使用Nokogiri来解析HTML并得到行。

doc = Nokogiri::parse(myfile.html)
table = doc.xpath("//span[@id='NumStrs']/../../..")
row = table.xpath["tr[1]"]

这是HTML:

<tr>
<th id ="langframe">
<span id="cabinet">
Cabinet</span>
</th>
<th id ="langbb1">
<span id="bb1">
BB1</span>
</th>
<th id ="langbb2">
<span id="bb2">
BB2</span>
</th>
<th id ="langtemp">
<span id="Temp">
Temperature</span>
</th>
<th id="langstrs">
<span id="StringsPresent">
Strings Present</span>
</th>
<th id="langmstrQty">
<span id="NumStrs">
Number of Strings</span>
</th>
</tr>

2 个答案:

答案 0 :(得分:2)

我是使用Ruby with_index结合select来实现的:

require 'nokogiri'  # => true

doc = Nokogiri::HTML(<<EOT)
<tr>
<th id ="langframe">
<span id="cabinet">
Cabinet</span>
</th>
<th id ="langbb1">
<span id="bb1">
BB1</span>
</th>
<th id ="langbb2">
<span id="bb2">
BB2</span>
</th>
<th id ="langtemp">
<span id="Temp">
Temperature</span>
</th>
<th id="langstrs">
<span id="StringsPresent">
Strings Present</span>
</th>
<th id="langmstrQty">
<span id="NumStrs">
Number of Strings</span>
</th>
</tr>
EOT

th_idx = doc.search('th').to_enum.with_index.select { |th, idx| th.text['Number of Strings'] }.first

返回:

th_idx 
# => [#(Element:0x3fe72d83cd3c {
#       name = "th",
#       attributes = [
#         #(Attr:0x3fe72d4440f4 { name = "id", value = "langmstrQty" })],
#       children = [
#         #(Text "\n"),
#         #(Element:0x3fe72d43c3e0 {
#           name = "span",
#           attributes = [
#             #(Attr:0x3fe72d439b04 { name = "id", value = "NumStrs" })],
#           children = [ #(Text "\nNumber of Strings")]
#           }),
#         #(Text "\n")]
#       }),
#     5]

索引是:

th_idx.last # => 5

获得th_idx后,您可以轻松访问父节点或子节点,以了解其周围环境:

th_node = th_idx.first
th_node['id'] # => "langmstrQty"
th_node.at('span')
# => #(Element:0x3fd5110286d8 {
#      name = "span",
#      attributes = [
#        #(Attr:0x3fd511021b6c { name = "id", value = "NumStrs" })],
#      children = [ #(Text "\nNumber of Strings")]
#      })
th_node.at('span')['id'] # => "NumStrs"

with_index为传递给它的每个元素添加一个从0开始的索引。 to_enum是必需的,因为search会返回一个NodeSet,它不是枚举器,因此to_enum会返回该值。

如果您希望基于1的索引使用with_index(1)

答案 1 :(得分:1)

让它工作,不确定这是否是有效的方法..但它的工作原理

header = table.xpath("tr[1]")
value = header.xpath("//span[@id='#{id}']").text
index = header.search('th//text()').collect {|text| text.to_s.strip}.reject(&:empty?).index(value)+1