我的网页内容与此类似:
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
我的目标是捕获#level2
中的文字,但#level3
<div>
内嵌在与我想要的文字相同的级别。
是否有可能排除<div>
?我应该修改文档并在解析之前简单地删除元素吗?
答案 0 :(得分:4)
require 'nokogiri'
xml = <<-XML
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
XML
page = Nokogiri::XML(xml)
p page.xpath("//*[@id='level3']").remove.xpath("//*[@id='level2']").inner_text
# => "\n \n Here is some text i want\n \n Here is some more text i want\n \n Oh i want this text too :)\n "
现在,您可以根据需要清理输出文本。
答案 1 :(得分:4)
如果您的HTML片段位于html
,那么您可以执行以下操作:
doc = Nokogiri::HTML(html)
div = doc.at_css('#level2') # Extract <div id="level2">
div.at_css('#level3').remove # Remove <div id="level3">
text_you_want = div.inner_text
你也可以用XPath做到这一点,但我发现CSS选择器对于这样的简单情况来说更简单。