使用Nokogiri迭代多个URL来解析HTML

时间:2013-03-16 18:38:27

标签: ruby nokogiri open-uri

我要做的是使用Nokogiri从多个供应商处清除商品的名称和价格。我正在通过方法参数将CSS选择器(找到名称和价格)传递给Nokogiri。

有关如何将多个URL传递给“scrape”方法同时传递其他参数的任何指导(例如:vendor,item_path)?或者我是以完全错误的方式解决这个问题的?

以下是代码:

require 'rubygems' # Load Ruby Gems
require 'nokogiri' # Load Nokogiri
require 'open-uri' # Load Open-URI

@@collection = Array.new # Array to hold meta hash

def scrape(url, vendor, item_path, name_path, price_path)
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.each do |item| # Iterates through each item on grid
        @@collection << meta = Hash.new # Creates a new hash then add to global array
        meta[:vendor] = vendor
        meta[:name] = item.css(name_path).text.strip
        meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join 
    end
end

scrape( "page_a.html", "Sample Vendor A", "#products", ".title", ".prices")
scrape( ["page_a.html", "page_b.html"], "Sample Vendor B",  "#items", ".productname", ".price")

2 个答案:

答案 0 :(得分:1)

您可以按照第二个示例中的方式传递多个url's

scrape( ["page_a.html", "page_b.html"], "Sample Vendor B",  "#items", ".productname", ".price")

您的scrape方法必须遍历这些urls,例如:

def scrape(urls, vendor, item_path, name_path, price_path)
  urls.each do |url|
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.each do |item| # Iterates through each item on grid
        @@collection << meta = Hash.new # Creates a new hash then add to global array
        meta[:vendor] = vendor
        meta[:name] = item.css(name_path).text.strip
        meta[:price] = item.css(price_path).to_s.scan(/\d+[.]\d+/).join 
    end 
  end   
end

这也意味着第一个示例也需要作为数组传递:

scrape( ["page_a.html"], "Sample Vendor A", "#products", ".title", ".prices")

答案 1 :(得分:1)

仅供参考,使用@@collection是不合适的。相反,编写您的方法以返回值:

def scrape(urls, vendor, item_path, name_path, price_path)
  collection = []
  urls.each do |url|
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.each do |item| # Iterates through each item on grid
      collection << {
        :vendor => vendor,
        :name   => item.css(name_path).text.strip,
        :price  => item.css(price_path).to_s.scan(/\d+[.]\d+/).join
      }
    end 
  end   

  collection
end

可以减少为:

def scrape(urls, vendor, item_path, name_path, price_path)
  urls.map { |url|
    doc = Nokogiri::HTML(open(url)) # Opens URL
    items = doc.css(item_path) # Sets items
    items.map { |item| # Iterates through each item on grid
      {
        :vendor => vendor,
        :name   => item.css(name_path).text.strip,
        :price  => item.css(price_path).to_s.scan(/\d+[.]\d+/).join
      }
    } 
  }
end