我目前在我的models /文件夹中有这个文件:
class Show < ActiveRecord::Base
require 'nokogiri'
require 'open-uri'
has_many :user_shows
has_many :users, through: :user_shows
def self.update_all_screenings
Show.all.each do |show|
show.update_attribute(:next_screening, Show.update_next_screening(show.url))
end
end
def self.update_next_screening(url)
nextep = Nokogiri::HTML(open(url))
## Finds the title of the show and extracts the date of the show and converts to string ##
begin
title = nextep.at_css('h1').text
date = nextep.at_css('.next_episode .highlight_date').text[/\d{1,2}\/\d{1,2}\/\d{4}/]
date = date.to_s
## Because if it airs today it won't have a date rather a time this checks whether or not
## there is a date. If there is it will remain, if not it will insert todays date
## plus get the time that the show is airing
if date =~ /\d{1,2}\/\d{1,2}\/\d{4}/
showtime = DateTime.strptime(date, "%m/%d/%Y")
else
date = DateTime.now.strftime("%D")
time = nextep.at_css('.next_episode .highlight_date').text[/\dPM|\dAM/]
time = time.to_s
showtime = date + " " + time
showtime = DateTime.strptime(showtime, "%m/%d/%y %l%p")
end
return showtime
rescue
return nil
end
end
end
然而,当我跑
时Show.update_all_screenings
需要很长时间才能完成。我目前有一个非常相似的脚本,它是一个rake文件,必须做两倍的抓取,并设法在大约10分钟内完成它,因为这个将花费8个小时。所以我想知道如何将此文件转换为rake任务?我建造的整个应用程序取决于它能够在最多1小时内完成。
以下是另一个参考脚本:
require 'mechanize'
namespace :show do
desc "add tv shows from web into database"
task :scrape => :environment do
puts 'scraping...'
Show.delete_all
agent = Mechanize.new
agent.get 'http://www.tv.com/shows/sort/a_z/'
agent.page.search('//div[@class="alphabet"]//li[not(contains(@class, "selected"))]/a').each do |letter_link|
agent.get letter_link[:href]
letter = letter_link.text.upcase
agent.page.search('//li[@class="show"]/a').map do |show_link|
Show.create(title: show_link.text, url:'http://tv.com' + show_link[:href].to_s + 'episodes/')
end
while next_page_link = agent.page.at('//div[@class="_pagination"]//a[@class="next"]') do
agent.get next_page_link[:href]
agent.page.search('//li[@class="show"]/a').map do |show_link|
Show.create(title: show_link.text, url:'http://tv.com' + show_link[:href].to_s + 'episodes/')
end
end
end
end
end
答案 0 :(得分:2)
Rake不是一个神奇的子弹 - 它不会更快地运行你的代码。
您可以做的是更有效地运行您的代码。代码中的主要时间消费者是迭代调用open(url)
。如果您可以同时阅读所有网址,整个过程应该花费现在所花费的时间的一小部分。
你可以使用typhoeus
gem(或其他一些宝石)为你处理这个问题。
- 危险!未经测试的代码! -
我没有使用此gem的经验,但您的代码看起来像这样:
require 'nokogiri'
require 'open-uri'
require 'typhoeus'
class Show < ActiveRecord::Base
has_many :user_shows
has_many :users, through: :user_shows
def self.update_all_screenings
hydra = Typhoeus::Hydra.hydra
Show.all.each do |show|
request = Typhoeus::Request.new(show.url, followlocation: true)
request.on_complete do |response|
show.update_attribute(:next_screening, Show.update_next_screening(response.body))
end
hydra.queue(request)
end
hydra.run
end
def self.update_next_screening(body)
nextep = Nokogiri::HTML(body)
## Finds the title of the show and extracts the date of the show and converts to string ##
begin
title = nextep.at_css('h1').text
date = nextep.at_css('.next_episode .highlight_date').text[/\d{1,2}\/\d{1,2}\/\d{4}/]
date = date.to_s
## Because if it airs today it won't have a date rather a time this checks whether or not
## there is a date. If there is it will remain, if not it will insert todays date
## plus get the time that the show is airing
if date =~ /\d{1,2}\/\d{1,2}\/\d{4}/
showtime = DateTime.strptime(date, "%m/%d/%Y")
else
date = DateTime.now.strftime("%D")
time = nextep.at_css('.next_episode .highlight_date').text[/\dPM|\dAM/]
time = time.to_s
showtime = date + " " + time
showtime = DateTime.strptime(showtime, "%m/%d/%y %l%p")
end
return showtime
rescue
return nil
end
end
end
以上内容应该收集一个队列中的所有请求,并同时运行它们,根据任何响应进行操作。