我想要一个Mechanize的缓存版本。这个想法是#get(uri ...)检查是否先前已经获取了uri,如果是,则从缓存中获取响应而不是访问Web。如果不在缓存中,它将访问Web并将响应保存在缓存中。
我天真的做法不起作用。 (我可能不需要提到CachedWebPage是ActiveRecord :: Base的子类):
class CachingMechanize < Mechanize
def get(uri, parameters = [], referer = nil, headers = {})
page = if (record = CachedWebPage.find_by_uri(uri.to_s))
record.contents
else
super.tap {|contents| CachedWebPage.create!(:uri => uri, :contents => contents)}
end
yield page if block_given?
page
end
end
这失败了,因为Mechanize#get()返回的对象是一个复杂的循环结构,YAML和JSON都不想序列化以存储到数据库中。
我意识到我想要的是在Mechanize解析它之前捕获低级内容。
答案 0 :(得分:2)
事实证明解决方案很简单,尽管不是很干净。像这样缓存Mechanize#get()的结果是一件简单的事情:
class CachingMechanize < Mechanize
def get(uri, parameters = [], referer = nil, headers = {})
WebCache.with_web_cache(uri.to_s) { super }
end
end
...其中with_web_cache()使用YAML来序列化和缓存super返回的对象。
我的问题是默认情况下,Mechanize#get()返回一个包含一些lambda对象的Mechanize :: Page对象,YAML无法转储和加载该对象。修复是为了消除那些lambdas,结果证明是相当简单的。完整代码如下。
class CachingMechanize < Mechanize
def initialize(*args)
super
sanitize_scheme_handlers
end
def get(uri, parameters = [], referer = nil, headers = {})
WebCache.with_web_cache(uri.to_s) { super }
end
# private
def sanitize_scheme_handlers
scheme_handlers['http'] = SchemeHandler.new
scheme_handlers['https'] = scheme_handlers['http']
scheme_handlers['relative'] = scheme_handlers['http']
scheme_handlers['file'] = scheme_handlers['http']
end
class SchemeHandler
def call(link, page) ; link ; end
end
end
这不仅仅是这个例子:如果你看到一个YAML错误:
TypeError: allocator undefined for Proc
检查您尝试序列化和反序列化的对象中是否存在lambda或proc。如果你能够(就像我在这种情况下)用方法调用对象替换lambda,你应该能够解决这个问题。
希望这有助于其他人。
为了回应@Martin关于WebCache定义的请求,请访问:
# Simple model for caching pages fetched from the web. Assumes
# a schema like this:
#
# create_table "web_caches", :force => true do |t|
# t.text "key"
# t.text "value"
# t.datetime "expires_at"
# t.datetime "created_at", :null => false
# t.datetime "updated_at", :null => false
# end
# add_index "web_caches", ["key"], :name => "index_web_caches_on_key", :unique => true
#
class WebCache < ActiveRecord::Base
serialize :value
# WebCache.with_web_cache(key) {
# ...body...
# }
#
# Searches the web_caches table for an entry with a matching key. If
# found, and if the entry has not expired, the value for that entry is
# returned. If not found, or if the entry has expired, yield to the
# body and cache the yielded value before returning it.
#
# Options:
# :expires_at sets the expiration date for this entry upon creation.
# Defaults to one year from now.
# :expired_prior_to overrides the value of 'now' when checking for
# expired entries. Mostly useful for unit testing.
#
def self.with_web_cache(key, opts = {})
serialized_key = YAML.dump(key)
expires_at = opts[:expires_at] || 1.year.from_now
expired_prior_to = opts[:expired_prior_to] || Time.zone.now
if (r = self.where(:key => serialized_key).where("expires_at > ?", expired_prior_to)).exists?
# cache hit
r.first.value
else
# cache miss
yield.tap {|value| self.create!(:key => serialized_key, :value => value, :expires_at => expires_at)}
end
end
# Prune expired entries. Typically called by a cron job.
def self.delete_expired_entries(expired_prior_to = Time.zone.now)
self.where("expires_at < ?", expired_prior_to).destroy_all
end
end