所以我有一些链接看起来像href
链接的内容:
<p>Here you can find
<a href="ssNODELINK/SurvivalStatistics">Survival stats </a>
<a href="ssNODELINK/SmokingStatistics">Smoking stats </a>
<a href="ssNODELINK/RisksAndCauses"> and Risks </a>
<a target="_blank" href="http://www.something.ac.uk/"> Something </a>
of recent research</p>
还有一些
我想要的结果是删除您列出的所有ssNODELINKs
并保留其他链接。结果如下:
在这里您可以找到生存统计数据吸烟统计数据和近期研究的风险Something
我尝试了以下几行代码来实现这一目标:
page_content.gsub!(/(<a href="ssNODELINK/a-zA-Z">)/, ''))
和
这只会删除部分内容
page_content.gsub!(/(<a href="ssNODELINK)/, ''))
关于如何达到我想要的结果的任何建议?
答案 0 :(得分:1)
我会这样做:
require 'nokogiri'
doc = Nokogiri.HTML <<-eot
<p>Here you can find
<a href="ssNODELINK/SurvivalStatistics">Survival stats </a>
<a href="ssNODELINK/SmokingStatistics">Smoking stats </a>
<a href="ssNODELINK/RisksAndCauses"> and Risks </a>
<a target="_blank" href="http://www.something.ac.uk/"> Something </a>
of recent research</p>
eot
nodesets = doc.css('p > a')
nodesets.each do |nd|
nd.unlink if nd['href'].include? 'ssNODELINK'
end
puts doc.to_html.gsub(/^\s*\n/, "")
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>Here you can find
# >> <a target="_blank" href="http://www.something.ac.uk/"> Something </a>
# >> of recent research</p></body></html>