Question

我有以下HTML块：

<tr>
   <th>Consignment Service Code</th>
   <td>ND16</td>
</tr>

我最终想要提取的是ND16字符串，但要做到这一点，我需要根据文字<tr>选择Consignment Service Code。

我正在使用Nokogiri解析HTML，所以继续使用它会很棒。

那么，如何根据文本“Consignment Service Code”选择该HTML块？

Answer 1

你可以这样做：

require 'nokogiri'

doc=Nokogiri::HTML::parse <<-eot
<tr>
   <th>Consignment Service Code</th>
   <td>ND16</td>
</tr>
eot

node = doc.at_xpath("//*[text()='Consignment Service Code']/following-sibling::*[1]")
puts node.text
# >> ND16

这是一个额外的尝试，可能会帮助你开始：

## parent node
parent_node = doc.at_xpath("//*[text()='Consignment Service Code']/..")
puts parent_node.name # => tr

## to get the child td
puts parent_node.at_xpath("//td").text # => ND16

puts parent_node.to_html

#<tr>
#<th>Consignment Service Code</th>
#   <td>ND16</td>
#</tr>

Answer 2

又一种方式。

使用Nokogiri的css方法查找相应的tr节点，然后选择th标记中包含所需文本的节点。最后，使用所选节点并提取td值：

require 'nokogiri'

str = '<tr>
   <th>Consignment</th>
   <td>ND15</td>
</tr>
<tr>
   <th>Consignment Service Code</th>
   <td>ND16</td>
</tr>
<tr>
   <th>Consignment Service Code</th>
   <td>ND17</td>
</tr>'

doc = Nokogiri::HTML.parse(str)
nodes = doc.css('tr')
           .select{|el| 
             el.css('th').text =~ /^Consignment Service Code$/
           }

nodes.each do |el|
  p el.css('td').text
end

输出是：

"ND16"
"ND17"

使用Nokogiri根据文本选择一个HTML块？

2 个答案: