使用Nokogiri刮取整个HTML标记

时间:2014-05-05 14:04:05

标签: html css ruby html-parsing nokogiri

我到处搜索,我发现只是用Nokogiri进行CSS选择,我所追求的只是摆脱所有HTML标签。

例如:

<html>
   <head><title>My webpage</title></head>
   <body>
   <h1>Hello Webpage!</h1>
   <div id="references">
      <p><a href="http://www.google.com">Click here</a> to go to the search engine Google</p>
      <p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p>
      <p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's Learn Python the Hard Way</a> a try</p>
   </div>

   <div id="funstuff">
      <p>Here are some entertaining links:</p>
      <ul>
         <li><a href="http://youtube.com">YouTube</a></li>
         <li><a data-category="news" href="http://reddit.com">Reddit</a></li>
         <li><a href="http://kathack.com/">Kathack</a></li>
         <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
      </ul>
   </div>

   <p>Thank you for reading my webpage!</p>

   </body>
<p>Addition</p>
</html> 
Extra content

应该ouptut as:

Hello Webpage!

Click here to go to the search engine Google

Or you can click here to go to Microsoft Bing.

Don't want to learn Ruby? Then give Zed Shaw's Learn Python the Hard Way a try

Here are some entertaining links:

YouTube
Reddit
Kathack
New York Times
Thank you for reading my webpage!
Addition
Extra content

如何使用Nokogiri做到这一点?还有什么我可以做其他代码,如Javascript?

2 个答案:

答案 0 :(得分:1)

require 'nokogiri'

html = %q{ 
  <html>
   <head><title>My webpage</title></head>
   <body>
   <h1>Hello Webpage!</h1>
   <div id="references">
     <p><a href="http://www.google.com">Click here</a> to go to the search engine Google</p>
     <p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p>
     <p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's    Learn Python the Hard Way</a> a try</p>
   </div>

   <div id="funstuff">
    <p>Here are some entertaining links:</p>
    <ul>
     <li><a href="http://youtube.com">YouTube</a></li>
     <li><a data-category="news" href="http://reddit.com">Reddit</a></li>
     <li><a href="http://kathack.com/">Kathack</a></li>
     <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
     </ul>
   </div>

   <p>Thank you for reading my webpage!</p>

   </body>
</html>
}

doc = Nokogiri::XML(html)
body = doc.search('body')
puts body.text.gsub(/<.*?\/?>/, '')

答案 1 :(得分:0)

有很多方法可以做你想做的事情,我会考虑使用包裹Nokogiri的丝瓜络。

在丝瓜络你会做类似的事情:

document = Loofah.fragment(html)
document.scrub!(:prune).text

Prune scrub删除所有不安全的标记和子树,文本输出每个节点的新行字符。