在Mechanize中迭代以抓取页面

时间:2015-04-16 18:49:12

标签: ruby web-crawler nokogiri mechanize

我希望使用Mechanize自动化一个进程来抓取一些网页并保存信息。

页面为 look book north america

我希望遍历ul id="looks",并在该迭代中,单击外观中的每个用户。所以元素看起来像这样:

<a href="/luciamouet" data-page-track="user name click" data-track="user name click | byline" target="_blank" title="Lucia Mouet">Lucia M.</a>

我希望转到每个用户并存储该页面的一些信息。

这是我到目前为止所做的,但是我很难过如何迭代并关注每个用户的链接:

require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'open-uri'

agent = Mechanize.new

page = agent.get('http://lookbook.nu/north-america')

looks = page.parser.css('#looks p')

 looks.each do |x|
     puts x
 end

2 个答案:

答案 0 :(得分:1)

您可以构建详细信息页面网址。抓取相对URL(我将其称为路径)附加基本URL并发出新请求。

require 'mechanize'

agent = Mechanize.new
agent.pluggable_parser.default = Mechanize::Page

base = 'http://lookbook.nu'
page = agent.get(base + '/north-america')

detail_pages = page.search("//div[contains(@class, 'look_meta_container')]/p/a[1]/@href").map(&:text)
# ["/user/1069907-Veronica-P", "/elliott_alexzander", "/neno", "/skirtsofurban", "/tovogueorbust", "/dthutt", "/ryapie", "/lovebetweentheracks", "/lonleyboy", "/bobbyraffin", "/tsangtastic", "/user/737385-Katia-H"]

detail_pages.each do |path|
  page = agent.get(base + path)

  name = page.search("//div[@id='userheader']//h1/a").text
  fans = page.search("//span[contains(text(), 'Fans')]/../span[1]").text

  puts name + " have " + fans + " fans"
end

=&GT;

Veronica  P have 26,044 fans
Elliott Alexzander have 3,409 fans
Neno Neno have 15,304 fans
Laura P have 975 fans
Alexandra G. have 620 fans
Dayeanne  Hutton have 336 fans
Mariah Alysz have 288 fans
Lina Dinh have 11,675 fans
Talal Amine have 882 fans
Bobby Raffin have 72,469 fans
Jenny Tsang have 8,909 fans
Katia H. have 282 fans

注意:我使用了#pluggable_parser.default来获得Mechanize::Page响应。通常您不需要,但他们没有正确设置内容类型。

答案 1 :(得分:1)

不要像@radubogdan所建议的那样乱用基础+路径,只需使用page.uri:

page.search('#looks h1 a').each do |a|
  url = page.uri.merge a[:href]
  page2 = agent.get url
  puts page2.title
end