为什么Nokogiri会给我多个结果?

时间:2013-09-26 00:02:21

标签: ruby-on-rails ruby ruby-on-rails-3 nokogiri

我正在尝试使用Nokogiri解析HTML字符串,但我遇到了一些递归问题,我无法弄清楚原因。

鉴于这些命令:

string = <h3>Lancers were arranged.&nbsp;</h3>
         <div>Gabriel found himself partnered with Miss Ivors.</div>
         <br>She leaned. He lit a <b>candle</b>.
         They followed him in silence, their feet falling in soft thuds on the thickly carpeted stairs.<br>

body = Nokogiri::HTML(string)
result = []
body.traverse { |node| result << node }

我期待上面元素的数组。相反,我得到了这个:

[#<Nokogiri::XML::DTD:0x3fde1f3d5274 name="html">
#<Nokogiri::XML::Text:0x3fde1e88d330 "Lancers were arranged. ">
#<Nokogiri::XML::Element:0x3fde1ea56a68 name="h3" children=[#<Nokogiri::XML::Text:0x3fde1e88d330 "Lancers were arranged. ">]>
#<Nokogiri::XML::Text:0x3fde1e88c764 "Gabriel found himself partnered with Miss Ivors.">
#<Nokogiri::XML::Element:0x3fde1e88cd04 name="div" children=[#<Nokogiri::XML::Text:0x3fde1e88c764 "Gabriel found himself partnered with Miss Ivors.">]>
#<Nokogiri::XML::Element:0x3fde1e88c0fc name="br">
#<Nokogiri::XML::Text:0x3fde1e88b9e0 "She leaned. He lit a ">
#<Nokogiri::XML::Text:0x3fde1eba6c60 "candle">
#<Nokogiri::XML::Element:0x3fde1e88b5f8 name="b" children=[#<Nokogiri::XML::Text:0x3fde1eba6c60 "candle">]>
#<Nokogiri::XML::Text:0x3fde1eba6454 ". They followed him in silence
their feet falling in soft thuds on the thickly carpeted stairs.">
#<Nokogiri::XML::Element:0x3fde1eba5f54 name="br">
#<Nokogiri::XML::Element:0x3fde1ea56f7c name="body" children=[#<Nokogiri::XML::Element:0x3fde1ea56a68 name="h3" children=[#<Nokogiri::XML::Text:0x3fde1e88d330 "Lancers were arranged. ">]>
#<Nokogiri::XML::Element:0x3fde1e88cd04 name="div" children=[#<Nokogiri::XML::Text:0x3fde1e88c764 "Gabriel found himself partnered with Miss Ivors.">]>
#<Nokogiri::XML::Element:0x3fde1e88c0fc name="br">
#<Nokogiri::XML::Text:0x3fde1e88b9e0 "She leaned. He lit a ">
#<Nokogiri::XML::Element:0x3fde1e88b5f8 name="b" children=[#<Nokogiri::XML::Text:0x3fde1eba6c60 "candle">]>
#<Nokogiri::XML::Text:0x3fde1eba6454 ". They followed him in silence
their feet falling in soft thuds on the thickly carpeted stairs.">
#<Nokogiri::XML::Element:0x3fde1eba5f54 name="br">]>
#<Nokogiri::XML::Element:0x3fde1ea575e4 name="html" children=[#<Nokogiri::XML::Element:0x3fde1ea56f7c name="body" children=[#<Nokogiri::XML::Element:0x3fde1ea56a68 name="h3" children=[#<Nokogiri::XML::Text:0x3fde1e88d330 "Lancers were arranged. ">]>
#<Nokogiri::XML::Element:0x3fde1e88cd04 name="div" children=[#<Nokogiri::XML::Text:0x3fde1e88c764 "Gabriel found himself partnered with Miss Ivors.">]>
#<Nokogiri::XML::Element:0x3fde1e88c0fc name="br">
#<Nokogiri::XML::Text:0x3fde1e88b9e0 "She leaned. He lit a ">
#<Nokogiri::XML::Element:0x3fde1e88b5f8 name="b" children=[#<Nokogiri::XML::Text:0x3fde1eba6c60 "candle">]>
#<Nokogiri::XML::Text:0x3fde1eba6454 ". They followed him in silence
their feet falling in soft thuds on the thickly carpeted stairs.">
#<Nokogiri::XML::Element:0x3fde1eba5f54 name="br">]>]>
#<Nokogiri::HTML::Document:0x3fde1f3d6084 name="document" children=[#<Nokogiri::XML::DTD:0x3fde1f3d5274 name="html">
#<Nokogiri::XML::Element:0x3fde1ea575e4 name="html" children=[#<Nokogiri::XML::Element:0x3fde1ea56f7c name="body" children=[#<Nokogiri::XML::Element:0x3fde1ea56a68 name="h3" children=[#<Nokogiri::XML::Text:0x3fde1e88d330 "Lancers were arranged. ">]>
#<Nokogiri::XML::Element:0x3fde1e88cd04 name="div" children=[#<Nokogiri::XML::Text:0x3fde1e88c764 "Gabriel found himself partnered with Miss Ivors.">]>
#<Nokogiri::XML::Element:0x3fde1e88c0fc name="br">
#<Nokogiri::XML::Text:0x3fde1e88b9e0 "She leaned. He lit a ">
#<Nokogiri::XML::Element:0x3fde1e88b5f8 name="b" children=[#<Nokogiri::XML::Text:0x3fde1eba6c60 "candle">]>
#<Nokogiri::XML::Text:0x3fde1eba6454 ". They followed him in silence
their feet falling in soft thuds on the thickly carpeted stairs.">
#<Nokogiri::XML::Element:0x3fde1eba5f54 name="br">]>]>]>] 

抱歉这个长度。任何人都可以帮我弄清楚为什么会这样吗?和/或如何预防?

2 个答案:

答案 0 :(得分:5)

这是因为traverse以递归方式调用提供的块本身及其所有子节点。因此,它将html字符串的每个节点添加到result数组,而不仅仅是顶级节点。您看到的“多重结果”是Nokogiri节点定义inspect的结果。例如,返回数组中的第3个元素表示h3节点,但也会打印其所有子节点,其中包含作为数组第2个元素的text节点。

如果您希望result包含对文档中每个节点的引用,那么这是正确的方法。如果您只想要顶级节点,请使用children

答案 1 :(得分:2)

当您解析不完整的html时,Nokogiri会自动添加doctype和html以及body元素。你必须像这样解析它以避免这种行为:

body = Nokogiri::HTML::DocumentFragment.parse(your_html)

如果你想要结果和除文本节点之外的元素数组,你可以这样做:

result = body.xpath('./*')

然后结果(为清晰起见转换为字符串)将是:

["<h3>Lancers were arranged. </h3>",
 "<div>Gabriel found himself partnered with Miss Ivors.</div>",
 "<br>",
 "<b>candle</b>",
 "<br>"]