Question

我在保存网址时尝试抓取网页的完整HTML。这是我的模型，我试图去编写方法。

class Page < ActiveRecord::Base
  def processPages(page_url)
    open(page_url) do |uri|
      html = uri.read
      create!( html => page.html )
    end
  end
end

我试图将html中存在的原始html放入我的页面对象的属性中，但无法理解如何保存内容。

我也在努力从我的控制器中调用processPages来创建动作，这个动作目前是基本的脚手架。

Answer 1

有很多方法可以做到这一点，我会使用after_save model callback，所以在后台完成获取html并且controller保持清洁。

class Page < ActiveRecord::Base
  require 'open-uri'

  after_save: process_pages

  def process_pages
    self.html = open(self.url).read
    self.save # note, this will check model validations, if want to skip it try model,update_attribute method
  end
end

由于url和html属于Page属性，因此无需将任何内容传递给方法＆amp;从这个SO question你可以找到更多关于html抓取的信息。

啊，而且ProcessPages真的看起来不像红宝石！所以我改为process_pages。

<强>更新

如果您需要解析页面内容，可以使用Nokogiri，如果您需要提交表单或其他内容，可以使用Mechanize，就像简单的html抓取... {{ 1}}将完成这项工作

执行保存方法

1 个答案: