我需要使用Nokogiri和CSS或XPath选择器来匹配以下HTML中的文本。它应该从<div>
标记class="propsBar"
开始匹配,并在<div>
标记的结尾处结束匹配class="oddsInfoBottom"
。应该这样做以识别这种模式的所有匹配:
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<input id="events[X2036-907-Yes-No-081414]" type="hidden" value="X2036-907-Yes-No-081414^No^Yes^Nationals (S Strasburg) @ Met…l there be a score in the 1st Inning?^8/14/2014^7:10 PM^2036" name="events[X2036-907-Yes-No-081414]"></input>
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<input id="events[X2036-915-Yes-No-081414]" type="hidden" value="X2036-915-Yes-No-081414^No^Yes^Astros (S Feldman) @ Red Sox …l there be a score in the 1st Inning?^8/14/2014^7:10 PM^2036" name="events[X2036-915-Yes-No-081414]"></input>
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<input id="events[X2036-917-Yes-No-081414]" type="hidden" value="X2036-917-Yes-No-081414^No^Yes^Rays (J Odorizzi) @ Rangers (…l there be a score in the 1st Inning?^8/14/2014^8:05 PM^2036" name="events[X2036-917-Yes-No-081414]"></input>
<div class="timeBar"></div>
上述HTML应该返回三个匹配。
到目前为止,我能够做到这一点的唯一方法是:
one = html.xpath("//div[@class='propsBar']")
two = html.xpath("//div[@class='oddsInfoTop']")
three = html.xpath("//div[@class='oddsInfoBottom']")
one.zip(two, three).flatten.each_slice(3).map(&:join)
这有缺点只返回文本,不再是Nokogiri元素。此外,我认为以这种方式解析是危险的,如果页面具有与one, two, three
匹配的不同数量的元素,它将会中断。
答案 0 :(得分:1)
我需要使用Nokogiri,CSS选择器或Xpath来匹配来自的文本 关注HTML。它应该匹配从标签开始 类=&#34; propsBar&#34;并在标签的结束侧结束匹配 class =&#34; oddsInfoBottom&#34;
但它们都是一样的,例如:
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
好的,这里有:
require 'nokogiri'
doc = Nokogiri::HTML(File.read("xml3.xml"))
doc.css('div.propsBar').each do |div|
puts div.to_html
current_node = div
while current_node = current_node.next_element
puts current_node.to_html
if current_node.has_attribute?'class'
if current_node['class'].match /\b oddsInfoBottom \b/xm
puts "-" * 10
break #Go get a new starting tag
end
end
end
end
--output:--
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
但这有缺点只返回文本,不再是 Nokogiri元素。
require 'nokogiri'
doc = Nokogiri::HTML(File.read("xml3.xml"))
groups = []
this_group = []
doc.css('div.propsBar').each do |tag|
this_group << tag
current_tag = tag
while current_tag = current_tag.next_element
this_group << current_tag
if current_tag.has_attribute?'class'
if current_tag['class'].match /\b oddsInfoBottom \b/xm
groups << this_group
this_group = []
break
end
end
end
end
groups.each do |group|
group.each do |tag|
puts tag.to_html
end
puts '-' * 10
end
--output:--
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
答案 1 :(得分:1)
我写的是:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
EOT
found_nodes = doc.search('div.propsBar').map{ |node|
nodes = [node]
loop do
node = node.next_sibling
nodes << node
break if node['class'] == 'oddsInfoBottom'
end
nodes
}
(请注意,我删除了<input>
标签,因为那些标签只会混淆输入HTML。当您提供输入数据时,请删除所有噪音。)
运行它会返回作为数组数组找到的节点。每个子数组包含在顺序遍历同级链后找到的各个节点:
require 'pp'
pp found_nodes
# >> [[#(Element:0x3ff00a4936a0 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a037c28 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a49363c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a03629c { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a4935b0 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a4668f8 { name = "class", value = "oddsInfoBottom" })]
# >> })],
# >> [#(Element:0x3ff00a49354c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a45c808 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a4934e8 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a45b084 { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a49345c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a8710ec { name = "class", value = "oddsInfoBottom" })]
# >> })],
# >> [#(Element:0x3ff00a4933f8 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a4979d0 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a493394 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a47e188 { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a493308 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a458f00 { name = "class", value = "oddsInfoBottom" })]
# >> })]]
请记住,解析后,文档是节点的链接列表。如果原始XML或HTML中存在换行符,则“文本”节点至少包含一个换行符(&#34; \n
&#34;)。因为它是一个列表,我们可以分别使用next_sibling
和previous_sibling
向前和向后移动。这使得真的很容易抓取小块,即使它们没有阻止包含你想要的内容的标签。
如果您希望返回的值类似于search
,css
或xpath
方法的输出,则内部变量nodes
将需要从Array更改为一个NodeSet:
found_nodes = doc.search('div.propsBar').map{ |node|
nodes = Nokogiri::XML::NodeSet.new(doc, [node])
loop do
node = node.next_sibling
nodes << node
break if node['class'] == 'oddsInfoBottom'
end
nodes
}
require 'pp'
pp found_nodes.map(&:to_html)
运行结果:
# >> ["<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>",
# >> "<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>",
# >> "<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>"]
最后,请注意我使用的是CSS选择器而不是XPath。我更喜欢它们,因为它们通常更具可读性和简洁性。 XPath更强大,因为它是为解析XML而制作的,在CSS选择器只能让我们接近我们想要的东西之后,通常可以完成我们在Ruby中必须做的所有繁重工作。使用那些为您完成工作的人,考虑更容易阅读和维护的内容。
答案 2 :(得分:1)
使用+
:
doc.search('.propsBar').each do |props_bar|
odds_info_top = props_bar.at('+ .oddsInfoTop')
puts props_bar.text, odds_info_top.text
end