如何将连续节点与Nokogiri进行匹配?

时间:2014-08-14 20:36:21

标签: html css ruby xpath nokogiri

我需要使用Nokogiri和CSS或XPath选择器来匹配以下HTML中的文本。它应该从<div>标记class="propsBar"开始匹配,并在<div>标记的结尾处结束匹配class="oddsInfoBottom"。应该这样做以识别这种模式的所有匹配:

<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
 BUY POINTS 
-->
<input id="events[X2036-907-Yes-No-081414]" type="hidden" value="X2036-907-Yes-No-081414^No^Yes^Nationals (S Strasburg) @ Met…l there be a score in the 1st Inning?^8/14/2014^7:10 PM^2036" name="events[X2036-907-Yes-No-081414]"></input>
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
 BUY POINTS 
-->
<input id="events[X2036-915-Yes-No-081414]" type="hidden" value="X2036-915-Yes-No-081414^No^Yes^Astros (S Feldman) @ Red Sox …l there be a score in the 1st Inning?^8/14/2014^7:10 PM^2036" name="events[X2036-915-Yes-No-081414]"></input>
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
 BUY POINTS 
-->
<input id="events[X2036-917-Yes-No-081414]" type="hidden" value="X2036-917-Yes-No-081414^No^Yes^Rays (J Odorizzi) @ Rangers (…l there be a score in the 1st Inning?^8/14/2014^8:05 PM^2036" name="events[X2036-917-Yes-No-081414]"></input>
<div class="timeBar"></div>

上述HTML应该返回三个匹配。

到目前为止,我能够做到这一点的唯一方法是:

one = html.xpath("//div[@class='propsBar']")
two = html.xpath("//div[@class='oddsInfoTop']")
three = html.xpath("//div[@class='oddsInfoBottom']")

one.zip(two, three).flatten.each_slice(3).map(&:join)

这有缺点只返回文本,不再是Nokogiri元素。此外,我认为以这种方式解析是危险的,如果页面具有与one, two, three匹配的不同数量的元素,它将会中断。

3 个答案:

答案 0 :(得分:1)

  

我需要使用Nokogiri,CSS选择器或Xpath来匹配来自的文本   关注HTML。它应该匹配从标签开始   类=&#34; propsBar&#34;并在标签的结束侧结束匹配   class =&#34; oddsInfoBottom&#34;

但它们都是一样的,例如:

<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>

好的,这里有:

require 'nokogiri'

doc = Nokogiri::HTML(File.read("xml3.xml"))

doc.css('div.propsBar').each do |div|
  puts div.to_html
  current_node = div

  while current_node = current_node.next_element
    puts current_node.to_html

    if current_node.has_attribute?'class'
      if current_node['class'].match /\b oddsInfoBottom \b/xm
        puts "-" * 10
        break  #Go get a new starting tag
      end
    end
  end
end

--output:--
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
  

但这有缺点只返回文本,不再是   Nokogiri元素。

require 'nokogiri'

doc = Nokogiri::HTML(File.read("xml3.xml"))

groups = []
this_group = []

doc.css('div.propsBar').each do |tag|
  this_group << tag
  current_tag = tag

  while current_tag = current_tag.next_element
    this_group << current_tag

    if current_tag.has_attribute?'class'
      if current_tag['class'].match /\b oddsInfoBottom \b/xm
        groups << this_group
        this_group = []
        break
      end
    end
  end

end


groups.each do |group|
  group.each do |tag|
    puts tag.to_html
  end
  puts '-' * 10
end

--output:--
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------

答案 1 :(得分:1)

我写的是:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
 BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
 BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
 BUY POINTS
-->
<div class="timeBar"></div>
EOT

found_nodes = doc.search('div.propsBar').map{ |node|
  nodes = [node]
  loop do
    node = node.next_sibling
    nodes << node
    break if node['class'] == 'oddsInfoBottom'
  end
  nodes
}

(请注意,我删除了<input>标签,因为那些标签只会混淆输入HTML。当您提供输入数据时,请删除所有噪音。)

运行它会返回作为数组数组找到的节点。每个子数组包含在顺序遍历同级链后找到的各个节点:

require 'pp'
pp found_nodes
# >> [[#(Element:0x3ff00a4936a0 {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a037c28 { name = "class", value = "propsBar" })]
# >>     }),
# >>   #(Text "\n"),
# >>   #(Element:0x3ff00a49363c {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a03629c { name = "class", value = "oddsInfoTop" })]
# >>     }),
# >>   #(Text "\n"),
# >>   #(Element:0x3ff00a4935b0 {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a4668f8 { name = "class", value = "oddsInfoBottom" })]
# >>     })],
# >>  [#(Element:0x3ff00a49354c {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a45c808 { name = "class", value = "propsBar" })]
# >>     }),
# >>   #(Text "\n"),
# >>   #(Element:0x3ff00a4934e8 {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a45b084 { name = "class", value = "oddsInfoTop" })]
# >>     }),
# >>   #(Text "\n"),
# >>   #(Element:0x3ff00a49345c {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a8710ec { name = "class", value = "oddsInfoBottom" })]
# >>     })],
# >>  [#(Element:0x3ff00a4933f8 {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a4979d0 { name = "class", value = "propsBar" })]
# >>     }),
# >>   #(Text "\n"),
# >>   #(Element:0x3ff00a493394 {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a47e188 { name = "class", value = "oddsInfoTop" })]
# >>     }),
# >>   #(Text "\n"),
# >>   #(Element:0x3ff00a493308 {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a458f00 { name = "class", value = "oddsInfoBottom" })]
# >>     })]]

请记住,解析后,文档是节点的链接列表。如果原始XML或HTML中存在换行符,则“文本”节点至少包含一个换行符(&#34; \n&#34;)。因为它是一个列表,我们可以分别使用next_siblingprevious_sibling向前和向后移动。这使得真的很容易抓取小块,即使它们没有阻止包含你想要的内容的标签。

如果您希望返回的值类似于searchcssxpath方法的输出,则内部变量nodes将需要从Array更改为一个NodeSet

found_nodes = doc.search('div.propsBar').map{ |node|
  nodes = Nokogiri::XML::NodeSet.new(doc, [node])
  loop do
    node = node.next_sibling
    nodes << node
    break if node['class'] == 'oddsInfoBottom'
  end
  nodes
}

require 'pp'
pp found_nodes.map(&:to_html)

运行结果:

# >> ["<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>",
# >>  "<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>",
# >>  "<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>"]

最后,请注意我使用的是CSS选择器而不是XPath。我更喜欢它们,因为它们通常更具可读性和简洁性。 XPath更强大,因为它是为解析XML而制作的,在CSS选择器只能让我们接近我们想要的东西之后,通常可以完成我们在Ruby中必须做的所有繁重工作。使用那些为您完成工作的人,考虑更容易阅读和维护的内容。

答案 2 :(得分:1)

使用+

doc.search('.propsBar').each do |props_bar|
  odds_info_top = props_bar.at('+ .oddsInfoTop')
  puts props_bar.text, odds_info_top.text
end