Rails中的Nokogiri - 刮取多页的结果

时间:2014-06-03 00:11:35

标签: ruby-on-rails web-scraping nokogiri

我在尝试从Subreddit的多个页面获取结果时在Rails中使用Nokogiri,但我只能获得第一页。关于如何实现这一点的任何想法。我似乎无法弄清楚这一点。这是我目前的RedditScraper类:

require 'nokogiri'
require 'open-uri'

class RedditScraper

def initialize
 @headline = []
end

def fetch_reddit_headlines
 page = Nokogiri::HTML(open("http://www.reddit.com/r/ruby"))
 page.css('a.title').each do |link|
   if link['href'].include?('http')
    @headline << { content: link.content, href: link['href'] }
   else
    @headline << { content: link.content, href: "http://reddit.com" + link['href'] }
   end
  end
  @headline
 end
end

刚添加

控制器方法:

def index
 @fetch_reddit = RedditScraper.new.fetch_reddit_headlines    
end

查看代码:

<ol>
 <% @fetch_reddit.each do |url| %>
  <li><%= link_to url[:content], url[:href], target: '_' %></li>
 <% end %>
</ol>

截图

enter image description here

1 个答案:

答案 0 :(得分:1)

如果你对Nokogiri使用Mechanize,你可以通过这样的方式点击下一页链接:

更新:修正了一些错误

require 'nokogiri'
require 'open-uri'
require 'mechanize'


class RedditScraper

  def initialize
    @headline = []
    @agent = Mechanize.new
  end

  def fetch_reddit_headlines
    mech_page = @agent.get('http://www.reddit.com/r/ruby')

    num_pages_to_scrape = 10
    count = 0

    while(num_pages_to_scrape > count)
      page = mech_page.parser

      page.css('a.title').each do |link|
        if link['href'].include?('http')
          @headline << { content: link.content, href: link['href'] }
        else
          @headline << { content: link.content, href: "http://reddit.com" + link['href'] }
        end
      end
      @headline

      count += 1
      mech_page = @agent.get(page.css('.nextprev').css('a').last.attributes["href"].value)
    end

    return @headline
  end
end


r = RedditScraper.new
r.fetch_reddit_headlines
puts r.instance_variable_get(:@headline)
puts r.instance_variable_get(:@headline).count