替换或删除内容中的链接

时间:2013-11-07 16:28:35

标签: ruby xml xpath nokogiri

所以我有一些链接看起来像href链接的内容:

<p>Here you can find
    <a href="ssNODELINK/SurvivalStatistics">Survival stats </a>
    <a href="ssNODELINK/SmokingStatistics">Smoking stats </a>
    <a href="ssNODELINK/RisksAndCauses"> and Risks </a>
    <a target="_blank" href="http://www.something.ac.uk/"> Something </a>
of recent research</p>

还有一些

我想要的结果是删除您列出的所有ssNODELINKs并保留其他链接。结果如下:

在这里您可以找到生存统计数据吸烟统计数据和近期研究的风险Something

我尝试了以下几行代码来实现这一目标:

page_content.gsub!(/(<a href="ssNODELINK/a-zA-Z">)/, ''))

这只会删除部分内容

page_content.gsub!(/(<a href="ssNODELINK)/, '')) 

关于如何达到我想要的结果的任何建议?

1 个答案:

答案 0 :(得分:1)

我会这样做:

require 'nokogiri'

doc = Nokogiri.HTML <<-eot
<p>Here you can find
    <a href="ssNODELINK/SurvivalStatistics">Survival stats </a>
    <a href="ssNODELINK/SmokingStatistics">Smoking stats </a>
    <a href="ssNODELINK/RisksAndCauses"> and Risks </a>
    <a target="_blank" href="http://www.something.ac.uk/"> Something </a>
of recent research</p>
eot

nodesets = doc.css('p > a')
nodesets.each do |nd|
  nd.unlink if nd['href'].include? 'ssNODELINK'
end

puts doc.to_html.gsub(/^\s*\n/, "") 
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>Here you can find
# >>     <a target="_blank" href="http://www.something.ac.uk/"> Something </a>
# >> of recent research</p></body></html>