Question

我正在尝试使用Nokogiri从两个独特的标签集之间提取文本。

在<h2 class="point">The problem</h2>和<h2 class="point">The solution</h2>之间的p-tag中获取文本的最佳方法是什么，然后是<h2 class="point">The solution</h2>和<div class="frame box sketh">之间的所有HTML？

完整html的示例：

<h2 class="point">The problem</h2>
<p>TEXT I WANT </p>
<h2 class="point">The solution</h2>
HTML I WANT with it's own set of tags (but never an <h2> or <div>)
<div class="frame box sketh"><img src="URL for Image I want later" alt="" /></div>

谢谢！

Answer 1

require 'nokogiri'

doc = Nokogiri.HTML(DATA)
doc.search('//h2/following-sibling::node()[name() != "h2" and name() != "div" and text() != "\n"]').each do |block|
  p block.text
end

__END__
<h2 class="point">The problem</h2>
<p>TEXT I WANT</p>
<h2 class="point">The solution</h2>
<div>dont capture this</div>
<span>HTML I WANT with it's <p>own set <b>of</b> tags</p></span>
<div class="frame box sketh"><img src="URL for Image I want later" alt="" /></div>

输出：

"TEXT I WANT"
"HTML I WANT with it's own set of tags"

此XPath选择h2的所有以下兄弟节点，这些节点不是h2，div或只包含字符串"\n"。

Answer 2

以下是如何在包含类点的两个h2之间获取p标签文本

//h2[@class="point"][1]/following-sibling::p[./following-sibling::h2[@class="point"]]/text()

对于第二，您应该探索w3schools，并以第一个为例进行操作。

Nokogiri用于在唯一标签集之间选择文本和html

2 个答案: