我正在使用Nokogiri来抓一个看起来像这样的网站:
<div class="BOX">
<div class="apple">This is an apple.</div>
<p>Apple a day, doctor away</p>
</div>
<div class="BOX">
<div class="iphone">This is an iPhone.</div>
<div class="android">This is an Android.</div>
<a href="www.apple.com">Apple home page</a>
<p>Snoop Lion has both. He's rich.</p>
</div>
我想在“BOX”div中删除所有内容。每个“BOX”都有自己独特的div和HTML标签,没有明显的模式。我该怎么做?
我的第一次尝试看起来像这样:
require 'uri-open'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.examplesite.com'))
doc.css('BOX').each do |box|
puts box.content
end
但它什么也没有回报。我可以解释一下发生了什么吗?
答案 0 :(得分:4)
你错过了一个点(.
)。
没有点,它匹配<BOX>
标记。要将元素与class="BOX"
匹配,您应该在其前面添加点。
doc.css('.BOX').each do |box|
# ^-- here
puts box.content
end
答案 1 :(得分:3)
我认为您应该使用#inner_html
方法而不是#content
。虽然您的CSS class selector
规则是错误的。代码应如下所示:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-eot
<div class="BOX">
<div class="apple">This is an apple.</div>
<p>Apple a day, doctor away</p>
</div>
<div class="BOX">
<div class="iphone">This is an iPhone.</div>
<div class="android">This is an Android.</div>
<a href="www.apple.com">Apple home page</a>
<p>Snoop Lion has both. Hes rich.</p>
</div>
eot
doc.css('.BOX').each do|n|
p n.inner_html
end
<强>输出:强>
<div class="apple">This is an apple.</div>
<p>Apple a day, doctor away</p>
<div class="iphone">This is an iPhone.</div>
<div class="android">This is an Android.</div>
<a href="www.apple.com">Apple home page</a>
<p>Snoop Lion has both. He's rich.</p>
#content
将通过删除每个div
节点内的html包装器为您提供所有文本。请参阅下文:
doc.css('.BOX').each do|n|
puts n.content
end
<强>输出:强>
This is an apple.
Apple a day, doctor away
This is an iPhone.
This is an Android.
Apple home page
Snoop Lion has both. He's rich.