Ruby Web Scrape(Nokogiri) - 清理

时间:2016-11-22 14:51:38

标签: ruby web nokogiri

我正在尝试如何抓取网站上的数据。

这是我经过几天的研究后整理而来的,然而,Nokogiri的输出并不像"清洁"正如我所料。当我打印我的阵列时,我得到了很多换行符#34; /n"在输出中。

require 'httparty'
require 'nokogiri'
require 'open-uri'
require 'pry'
require 'csv'

# Assigning the page to scrape
page = HTTParty.get('http://www.realtor.com/realestateandhomes-search/Atlanta_GA/type-single-family-home/price-na-500000')

# Transform the http response into a Nokogiri in order to parse it
parse_page = Nokogiri::HTML(page)

# Create an empty array for property details
details_array = []
parse_page.css('div.srp-item-body').map do |d|
    property_details = d.text
    details_array.push(property_details)
end

Pry.start(binding)

在Pry中,如果我显示details_arrayaddress_array,则输出如下:

[2] pry(main)> details_array
=> ["\n      \n        \n          \n                2265 Tanglewood Cir NE,\n            Atlanta,\n            GA\n            30345\n \n        \n\n        \n          Dresden East\n        \n        \n\n            $289,900\n          \n          \n            \n        3 bd\n                2 ba\n                1,566 sq ft\n             
0.3 acres lot\n            \n          \n        \n          \n            Single Family Home\n          \n        \n          \n            \n  
Brokered by Re/Max Town And Country\n            \n          \n       
\n        \n          \n            Brokered by \n            Re/Max
Town And Country\n          \n        \n      \n    ",  "\n      \n   
\n          \n                2141 Dunwoody Gln,\n           
Atlanta,\n            GA\n            30338\n          \n        \n\n 
\n          \n            $469,900\n          \n          \n          
\n                4 bd\n                3 ba\n                2,850 sq
ft\n                0.3 acres lot\n                2 car\n           
\n          \n        \n          \n            Single Family Home\n  
\n        \n          \n            \n              Brokered by
Buckhead Home Realty Llc\n            \n          \n        \n       
\n          \n            Brokered by \n            Buckhead Home
Realty Llc\n          \n        \n      \n    ",  "\n      \n       
\n          \n                1048 Martin St SE,\n           
Atlanta,\n            GA\n            30315\n          \n        \n\n 
\n          Intown South\n          Peoplestown\n        \n        \n 
\n            $164,900\n          \n          \n            \n        
5 bd\n                3 ba\n                2,376 sq ft\n             
7,405 sq ft lot\n            \n          \n        \n          \n     
Single Family Home\n          \n        \n          \n            \n  
Brokered by Greenlet Llc\n            \n          \n        \n       
\n          \n            Brokered by \n            Greenlet Llc\n    
\n        \n      \n    ",  "\n      \n        \n          \n         
1048 Martin St SE,\n            Atlanta,\n            GA\n           
30315\n          \n        \n\n        \n          Intown South\n     
Peoplestown\n        \n        \n          \n            $164,900\n   
\n          \n            \n                5 bd\n                3
ba\n                2,055 sq ft\n                7,584 sq ft lot\n    
\n          \n        \n          \n            Single Family Home\n  
\n        \n          \n            \n              Brokered by
Greenlet, Llc\n            \n          \n        \n        \n         
\n            Brokered by \n            Greenlet, Llc\n          \n   
\n      \n    ",  "\n      \n        \n          \n               
1991 Woodbine Ter NE,\n            Atlanta,\n            GA\n         
30329\n          \n        \n\n        \n          Sagamore Hills\n   
\n        \n          \n            $299,900\n          \n          \n
\n                3 bd\n                1+ ba\n                1,449
sq ft\n                0.8 acres lot\n            \n          \n      
\n          \n            Single Family Home\n          \n        \n  
\n           :

1 个答案:

答案 0 :(得分:0)

看起来你并没有使用你的选择器深入挖掘文档。考虑一下:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div>
      <p>foo</p>
      <p>bar</p>
    </div>
  </body>
</html>
EOT

doc.search('div').map(&:text) # => ["\n      foo\n      bar\n    "]

查看父标记的文本时,您将获得用于格式化HTML的文本节点,以及所需<p>节点的文本。

如果您深入查看所需的实际节点,然后获取其文本,则会删除标签间格式:

doc.search('div p').map(&:text) # => ["foo", "bar"]

另请参阅“How to avoid joining all text from Nodes when scraping”。