Question

我正在尝试抓取一个xml网站并从中获取内容。

class PageScraper
  def get_page_details         
    if xml_data
      #get the info  from xml website
    else
      #get it from html website
    end
  end
  def get_xml_details
    if xml_data
      #get it from xml website
    end
  end
  def xml_data
    xml_url = www.abcd.xml
    #Download and parse the xml data from abcd.xml site using nokogiri-gem
  end
end

此处，还有其他需要获取xml_data方法的方法。每次都需要从xml网站上获取和下载数据。

有没有办法在第一次调用xml数据时将其存储在变量中（如@data = xml_data()）并返回下载的xml_data？在接下来的xml_data调用中，它应该能够引用缓存的@data。

Answer 1

你为什么不使用OpenURI和Nokogiri？检索和解析XML的正常过程将完成您想要做的事情。 Nokogiri site充满了例子。

就您的课程而言，您可能需要一种方法来检索页面，该方法还会将其存储在实例或类变量中，这取决于该类是负责多个页面还是只负责一个。

作为一个例子，这里有一些用于解析HTML的代码，这几乎与解析XML所做的相同。唯一真正的区别是使用Nokogiri::XML代替Nokogiri::HTML：

require 'open-uri'
require 'nokogiri'

class PageScraper

  def initialize(url)
    @source = open(url).read
    @dom = Nokogiri::HTML(@source)
  end

  def errors?
    !@dom.errors.empty?
  end

  def title
    @dom.title
  end

  def head
    @dom.at('head')
  end

  def body
    @dom.at('body')
  end

end

当然，您可以更改head和body等各种元素的访问者，以匹配您的特定用例。

运行之后，HTML（或XML）和解析后的HTML / XML DOM都可以作为实例变量使用，这样您就可以轻松引用它们。实际上没有必要@source，因为它可以使用@dom.to_xml或@dom.to_html恢复，除非源中有错误，在这种情况下Nokogiri会尝试修复可能导致的文档它与原版不同。

它的用法如下：

page_scraper = PageScraper('http://www.example.com')
abort "HTML errors found" if page_scraper.errors? 

page_title_text = page_scraper.title.text
page_scraper.title.text = 'Foo bar'
page_css = page_scraper.head.at('style').text

跨方法访问变量

1 个答案: