Question

我使用Nokogiri在Ruby中编写了一个简短的脚本来从网页中提取一些数据。该脚本工作正常，但它当前返回几个嵌套标签作为单个Nokogiri :: XML :: Element。

脚本如下：

require 'rubygems'
require 'nokogiri'

#some dummy content that mimics the structure of the web page
dummy_content = '<div id="div_saadi"><div><div style="padding:10px 0"><span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span</div></div></div>'
page = Nokogiri::HTML(dummy_content)

#grab the second div inside of the div entitled div_saadi
result = page.css('div#div_saadi div')[1]

puts result
puts result.class

输出如下：

<div style="padding:10px 0">
<span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span
</div>
Nokogiri::XML::Element

我想做的是产生以下输出（使用像.each这样的东西）：

content
content outside of the span
morecontent
morecontent outside of the span

Answer 1

你越来越近了，但并不了解你的回归。

根据HTML标记，您可以获得嵌入式标记。发生了什么：您要求单个节点，但它包含其他节点：

puts page.css('div#div_saadi div')[1].to_html
# >> <div style="padding:10px 0">
# >> <span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span</div>

text适用于NodeSet和Node。它只是抓住你指出的任何文本并返回它并且不关心它必须下降多少级别才能做到这一点：

result = page.css('div#div_saadi div')[1].text
# => "contentcontent outside of the spanmorecontentmorecontent outside of the span"

相反，您必须迭代各个嵌入节点并提取其文本：

require 'nokogiri'

dummy_content = '<div id="div_saadi"><div><div style="padding:10px 0"><span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span</div></div></div>'
page = Nokogiri::HTML(dummy_content)

result = page.css('div#div_saadi div')[1]
puts result.children.map(&:text)

# >> content
# >> content outside of the span
# >> morecontent
# >> morecontent outside of the span

children将所有嵌入节点作为NodeSet返回。迭代返回Nodes，并在该点的特定节点上使用text将返回您想要的内容。

如何进一步处理Nokogiri :: XML :: Element？

1 个答案: