Question

我在下面设置了Nokogiri刮刀。一切正常，它真的很慢。我一直在研究如何使它更快，并且遇到使用线程，将它移动到后台进程，保存到db并缓存它，我只是不确定要采取什么路由，或者从哪里开始。我们将非常感谢您的建议或指导。您可以在http://clstorycloud.com查看实时版本。刮刀抓取同一域中不同博客的博客图片和帖子，目前实时搜索。

模型

class Photocloud < ActiveRecord::Base
  attr_reader :start_urls
  attr_accessor :images, :paths

  def initialize(start_urls)
    @start_urls = start_urls
    @paths = []
    @images = []
  end

  def scrape
    start_urls.each do |start_url|
      blog = Nokogiri::HTML(open(start_url))
      scrape_images(blog)
      scrape_paths(blog)
    end
  end


  private
  def scrape_images(blog)
    images = blog.xpath('//*[@class="postBody"]/div[1]//img/@src')
    images.each do |image|
      @images << image
    end
  end

  def scrape_paths(blog)      
    story_path = blog.xpath('//*[@class="postTitle"]/a/@href')
    story_path.each do |path|
      @paths << path
    end
  end
end

视图

<div id="container" class="container">
  <% @paths.zip(@images).each do |url, img|%>
  <div class="item tranz ">
    <a href="<%= url %>" target="_blank"><img src="http://www.cltampa.com<%= img %>"></a>
  </div>
  <% end %>
  </div>
</div>

控制器

def index
  start_urls = %w[http://cltampa.com/blogs/potlikker 
    http://cltampa.com/blogs/artbreaker 
    http://cltampa.com/blogs/politicalanimals 
    http://cltampa.com/blogs/earbuds 
    http://cltampa.com/blogs/dailyloaf]
  scraper = Photocloud.new(start_urls)
  scraper.scrape
  @images = scraper.images
  @paths = scraper.paths
end

Answer 1

我建议使用typheus制作pararell请求。

现在你的scrape方法在开始下一个请求之前得到了一个请求，你可以优化它，如果你做了pararell请求（一次发送所有请求，而不是一个接一个地发送）。

使用typheous您的代码如下：

hydra = Typhoeus::Hydra.hydra  
start_urls.each do |start_url|
  # Build a request object representing the actual request you want to send and add it to the queue.
  hydra.queue(Typhoeus::Request.new(start_url))
end
# Then, run all the queued request in pararell.
hydra.run

# Then, you can get all requests response like this
responses = requests.map do |request| 
  request.response.body
  # Any other code here
end

通过这种方法，您可以优化刮刀。假设您有10个处理请求，每个请求需要10秒。使用实际方法，总处理时间将为100秒。通过在pararell发送您的所有请求，总处理时间将只有10。

您可以找到有关typheous here的所有文档。要了解有关pararell请求的部分，请点击here。

在Rails中优化Nokogiri铲运机

模型

视图

控制器

1 个答案: