如何从Rails和Nokogiri迭代网站的页面

时间:2017-03-20 09:44:22

标签: ruby-on-rails ruby nokogiri

我正在尝试建立一个信息网站,向访问者显示该特定网页上特定商家的所有优惠。我设法从第一页抓取标题并将URL迭代打包到数组中。

我的代码应该获取每个URL并将其粘贴到scraper中,列出该页面的项目,迭代到下一页,抓取标题并将它们附加到最近完成的列表,依此类推。

我的控制器看起来像这样:

    class ApplicationController < ActionController::Base
  # Prevent CSRF attacks by raising an exception.
  # For APIs, you may want to use :null_session instead.
  protect_from_forgery with: :exception

  class Entry
    def initialize(title)
      @title = title
    end
    attr_reader :title
  end


  def scrape_mydealz 
    require 'open-uri'
    urlarray = Array.new
    # ---------------------------------------------------------------   URL erstellen
    pagination = '&page=1' 
    count = [1, 2]
    count.each do |i|
        base_url = "https://www.mydealz.de/search?q=media+markt"
        pagination = "&page=#{i}"
        combination = base_url + pagination
        urlarray << combination
    end
    # --------------------------------------------------------------- / URL erstellen

    urlarray.each do |test|
        doc = Nokogiri::HTML(open("#{test}"))
        entries = doc.css('article.thread')
        @entriesArray = []
        entries.each do |entry|
            title = entry.css('a.vwo-thread-title').text
        @entriesArray << Entry.new(title)
       end
   end
    render template: 'scrape_mydealz'
  end
end

使用此代码,它将迭代到第2页,并仅显示第2页的刮擦结果。

结果可以在这里找到: https://mm-scraper-neevoo.c9users.io/

2 个答案:

答案 0 :(得分:0)

您在每次迭代中重新初始化@entriesArray。最简单的解决方案,将初始化移到循环外

@entriesArray = []

urlarray.each do |test|
    doc = Nokogiri::HTML(open("#{test}"))
    entries = doc.css('article.thread')
    entries.each do |entry|
        title = entry.css('a.vwo-thread-title').text
        @entriesArray << Entry.new(title)
   end
end

答案 1 :(得分:0)

这是未经测试的,但它是我用来扫描两页网站并累积标题的一般想法:

require 'open-uri'

BASE_URL = 'https://www.mydealz.de/search?q=media+markt&page=1'

def scrape_mydealz 

  urls = []
  2.times do |i|
    url = URI.parse(BASE_URL)
    base_query = URI::decode_www_form(url.query).to_h
    base_query['page'] = 1 + i
    url.query = URI.encode_www_form(base_query)
    urls << url
  end

  @entries_array = []
  urls.each do |url|
    doc = Nokogiri::HTML(open(url))
    doc.css('article.thread').each do |entry|
      @entries_array << Entry.new(entry.at('a.vwo-thread-title').text)
    end
  end
  render template: 'scrape_mydealz'
end

谨慎使用textsearchcssxpath

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
  </body>
</html>
EOT

doc.search('p').text # => "foobar"
doc.search('p').map(&:text) # => ["foo", "bar"]

请注意,第一个结果已连接<p>标记的内容。之后通常不会尝试将它们分开。