首先感谢大家帮助像我这样的程序员在解决日常问题时提出的宝贵意见。
这是我在堆栈溢出中的第一个问题,因为我几乎在一周内遇到了这个问题。
我们正在构建一个抓取特定网站并从中提取内容的抓取工具,我们正在使用机械化来实现这一点,因为我们花了很多时间来决定使用redis resque将爬行过程作为后台任务运行宝石,但在将过程发送到后台时,我遇到了错误标题,
我在lib / parsers / home.rb中的代码
require 'resque'
require File.dirname(__FILE__)+"/../index"
class Home < Index
Resque.enqueue(Index , :page )
def self.perform(page)
super (page)
search_form = page.form_with :name=>"frmAgent"
resuts_page = search_form.submit
total_entries = resuts_page.parser.xpath('//*[@id="PagingTable"]/tr[2]/td[2]').text
if total_entries =~ /(\d+)\s*$/
total_entries = $1
else
total_entries = "unknown"
end
start_res_idx = 1
while true
puts "Found #{total_entries} entries"
detail_links = resuts_page.parser.xpath('//*[@id="MainTable"]/tr/td/a')
detail_links.each do |d_link|
if d_link.attribute("class")
next
else
data_page = @agent.get d_link.attribute("href")
fields = get_fields_from_page data_page
save_result_page page.uri.to_s, fields
#break
end
end
site_done
rescue Exception => e
puts "error: #{e}"
end
end
和lib / index.rb中的超类是
require 'resque'
require 'mechanize'
require 'mechanize/form'
class Index
@queue = :Index_queue
def initialize(site)
@site = site
@agent = Mechanize.new
@agent.user_agent = Mechanize::AGENT_ALIASES['Windows Mozilla']
@agent.follow_meta_refresh = true
@rows_parsed = 0
@rows_total = 0
rescue Exception => e
log "Unable to login: #{e.message}"
end
def run
log "Parsing..."
url = "unknown"
if @site.url
url = @site.url
log "Opening #{url} as a data page"
@page = @agent.get(url)
#perform method should be override in subclasses
@data = self.perform(@page)
else
#some sites do not have "datapage" URL
#for example after login you're already on your very own datapage
#this is to be addressed in 'perform' method of subclass
@data = self.perform(nil)
end
rescue Exception=>e
puts "Failed to parse URL '#{url}', exception=>"+e.message
set_site_status("error "+e.message)
end
#overriding method
def self.perform(page)
end
def save_result_page(url, result_params)
result = Result.find_by_sql(["select * from results where site_id = ? AND ref_code = ?", @site.id, utf8(result_params[:ref_code])]).first
if result.nil?
result_params[:site_id] = @site.id
result_params[:time_crawled] = DateTime.now().strftime "%Y-%m-%d %H:%M:%S"
result_params[:link] = url
result = Result.create result_params
else
result.result_fields.each do |f|
f.delete
end
result.link = url
result.time_crawled = DateTime.now().strftime "%Y-%m-%d %H:%M:%S"
result.html = result_params[:html]
fields = []
result_params[:result_fields_attributes].each do |f|
fields.push ResultField.new(f)
end
result.result_fields = fields
result.save
end
@rows_parsed +=1
msg = "Saved #{@rows_parsed}"
msg +=" of #{@rows_total}" if @rows_total.to_i > 0
log msg
return result
end
end
这段代码有什么问题?
谢谢