将rails任务转换为rake

时间:2014-05-20 04:42:40

标签: ruby-on-rails ruby

我目前在我的models /文件夹中有这个文件:

class Show < ActiveRecord::Base
  require 'nokogiri'
  require 'open-uri'

  has_many :user_shows
  has_many :users, through: :user_shows

  def self.update_all_screenings
    Show.all.each do |show|
        show.update_attribute(:next_screening, Show.update_next_screening(show.url))
    end
  end

  def self.update_next_screening(url)
    nextep = Nokogiri::HTML(open(url))
    ## Finds the title of the show and extracts the date of the show and converts to string ##
    begin

        title = nextep.at_css('h1').text
        date = nextep.at_css('.next_episode .highlight_date').text[/\d{1,2}\/\d{1,2}\/\d{4}/]
        date = date.to_s

    ## Because if it airs today it won't have a date rather a time this checks whether or not 
    ## there is a date. If there is it will remain, if not it will insert todays date
    ## plus get the time that the show is airing    
        if date =~ /\d{1,2}\/\d{1,2}\/\d{4}/
            showtime = DateTime.strptime(date, "%m/%d/%Y")
        else
            date = DateTime.now.strftime("%D")
            time = nextep.at_css('.next_episode .highlight_date').text[/\dPM|\dAM/]
            time = time.to_s
            showtime = date + " " + time
            showtime = DateTime.strptime(showtime, "%m/%d/%y %l%p")

        end

        return showtime

    rescue
        return nil
    end
  end
end

然而,当我跑

Show.update_all_screenings

需要很长时间才能完成。我目前有一个非常相似的脚本,它是一个rake文件,必须做两倍的抓取,并设法在大约10分钟内完成它,因为这个将花费8个小时。所以我想知道如何将此文件转换为rake任务?我建造的整个应用程序取决于它能够在最多1小时内完成。

以下是另一个参考脚本:

require 'mechanize'

namespace :show  do

  desc "add tv shows from web into database"
  task :scrape => :environment do
    puts 'scraping...'

    Show.delete_all

agent = Mechanize.new
agent.get 'http://www.tv.com/shows/sort/a_z/'
agent.page.search('//div[@class="alphabet"]//li[not(contains(@class, "selected"))]/a').each do |letter_link|
  agent.get letter_link[:href]
  letter = letter_link.text.upcase
  agent.page.search('//li[@class="show"]/a').map do |show_link| 
    Show.create(title: show_link.text, url:'http://tv.com' + show_link[:href].to_s + 'episodes/')
  end
  while next_page_link = agent.page.at('//div[@class="_pagination"]//a[@class="next"]') do
    agent.get next_page_link[:href]
    agent.page.search('//li[@class="show"]/a').map do |show_link|
      Show.create(title: show_link.text, url:'http://tv.com' + show_link[:href].to_s + 'episodes/')
  end
  end
end

end
end

1 个答案:

答案 0 :(得分:2)

Rake不是一个神奇的子弹 - 它不会更快地运行你的代码。

可以做的是更有效地运行您的代码。代码中的主要时间消费者是迭代调用open(url)。如果您可以同时阅读所有网址,整个过程应该花费现在所花费的时间的一小部分。

你可以使用typhoeus gem(或其他一些宝石)为你处理这个问题。

- 危险!未经测试的代码! -

我没有使用此gem的经验,但您的代码看起来像这样:

require 'nokogiri'
require 'open-uri'
require 'typhoeus'

class Show < ActiveRecord::Base


  has_many :user_shows
  has_many :users, through: :user_shows

  def self.update_all_screenings
    hydra = Typhoeus::Hydra.hydra
    Show.all.each do |show|
      request = Typhoeus::Request.new(show.url, followlocation: true)
      request.on_complete do |response|
        show.update_attribute(:next_screening, Show.update_next_screening(response.body))
      end
      hydra.queue(request)
    end
    hydra.run
  end

  def self.update_next_screening(body)
    nextep = Nokogiri::HTML(body)
    ## Finds the title of the show and extracts the date of the show and converts to string ##
    begin

        title = nextep.at_css('h1').text
        date = nextep.at_css('.next_episode .highlight_date').text[/\d{1,2}\/\d{1,2}\/\d{4}/]
        date = date.to_s

    ## Because if it airs today it won't have a date rather a time this checks whether or not 
    ## there is a date. If there is it will remain, if not it will insert todays date
    ## plus get the time that the show is airing    
        if date =~ /\d{1,2}\/\d{1,2}\/\d{4}/
            showtime = DateTime.strptime(date, "%m/%d/%Y")
        else
            date = DateTime.now.strftime("%D")
            time = nextep.at_css('.next_episode .highlight_date').text[/\dPM|\dAM/]
            time = time.to_s
            showtime = date + " " + time
            showtime = DateTime.strptime(showtime, "%m/%d/%y %l%p")

        end

        return showtime

    rescue
        return nil
    end
  end
end

以上内容应该收集一个队列中的所有请求,并同时运行它们,根据任何响应进行操作。