我在尝试从Subreddit的多个页面获取结果时在Rails中使用Nokogiri,但我只能获得第一页。关于如何实现这一点的任何想法。我似乎无法弄清楚这一点。这是我目前的RedditScraper类:
require 'nokogiri'
require 'open-uri'
class RedditScraper
def initialize
@headline = []
end
def fetch_reddit_headlines
page = Nokogiri::HTML(open("http://www.reddit.com/r/ruby"))
page.css('a.title').each do |link|
if link['href'].include?('http')
@headline << { content: link.content, href: link['href'] }
else
@headline << { content: link.content, href: "http://reddit.com" + link['href'] }
end
end
@headline
end
end
刚添加
控制器方法:
def index
@fetch_reddit = RedditScraper.new.fetch_reddit_headlines
end
查看代码:
<ol>
<% @fetch_reddit.each do |url| %>
<li><%= link_to url[:content], url[:href], target: '_' %></li>
<% end %>
</ol>
截图
答案 0 :(得分:1)
如果你对Nokogiri使用Mechanize,你可以通过这样的方式点击下一页链接:
更新:修正了一些错误
require 'nokogiri'
require 'open-uri'
require 'mechanize'
class RedditScraper
def initialize
@headline = []
@agent = Mechanize.new
end
def fetch_reddit_headlines
mech_page = @agent.get('http://www.reddit.com/r/ruby')
num_pages_to_scrape = 10
count = 0
while(num_pages_to_scrape > count)
page = mech_page.parser
page.css('a.title').each do |link|
if link['href'].include?('http')
@headline << { content: link.content, href: link['href'] }
else
@headline << { content: link.content, href: "http://reddit.com" + link['href'] }
end
end
@headline
count += 1
mech_page = @agent.get(page.css('.nextprev').css('a').last.attributes["href"].value)
end
return @headline
end
end
r = RedditScraper.new
r.fetch_reddit_headlines
puts r.instance_variable_get(:@headline)
puts r.instance_variable_get(:@headline).count