我和nokogiri有问题。假设我有这个HTML
<html>
<p>
This is just an example, how to remove the next sentence using nokogiri in Ruby.
Thank you for your help.
<strong> XXXX </strong>
<br/>
<br />
I want to remove all the HTML after the strong XXXX
<br />
<br />
<strong> YYY </strong>
</p>
我怎样才能获得"This is just an example, how to remove the next sentence using nokogiri ... Thank you for your help."
?我不希望将<strong> XXXX
中的HTML包含在其余部分中。
答案 0 :(得分:1)
要明确排除,您可能需要尝试
doc.search('//p/text()[not(preceding-sibling::strong)]').text
这表示所有文本节点都不在strong
之后。
根据您的输入,这将提取以下内容:
This is just an example, how to remove the next sentence using nokogiri in Ruby.
Thank you for your help.
答案 1 :(得分:0)
如果您只是想获取文本(这是我认为您要问的那样),那么您可以在Nokogiri元素上调用text方法。这将返回给你“...感谢您的帮助XXX我希望在强大的XXXX YYY之后删除所有HTML”。这是Nokogiri documentation的链接,如果这有用的话 - 它会讨论文本方法。或者你是在谈论试图在标签后没有得到任何text / html?
答案 2 :(得分:0)
希望您正在寻找以下内容:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-_HTML_
<p>
This is just an example, how to remove the next sentence using nokogiri in Ruby.
Thank you for your help.
<strong> XXXX </strong>
<br/>
<br />
I want to remove all the HTML after the strong XXXX
<br />
<br />
<strong> YYY </strong>
</p>
_HTML_
puts doc.at('//p/text()[1]').to_s.strip
# >> This is just an example, how to remove the next sentence using nokogiri in Ruby.
# >> Thank you for your help.
现在,如果您想根据源html本身删除不需要的html内容,那么您可以尝试以下内容:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-_HTML_
<p>
This is just an example, how to remove the next sentence using nokogiri in Ruby.
Thank you for your help.
<strong> XXXX </strong>
<br/>
<br />
I want to remove all the HTML after the strong XXXX
<br />
<br />
<strong> YYY </strong>
</p>
_HTML_
doc.xpath('//p/* | //p/text()').count # => 10
ndst = doc.search('//p/* | //p/text()')[1..-1]
ndst.remove
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>
# >> This is just an example, how to remove the next sentence using nokogiri in Ruby.
# >> Thank you for your help.
# >> </p></body></html>