Question

我编写的代码可以抓取并解析此网站上的信息=＆gt; www.africancollective.come /眉毛/非裔文献/小说

require 'ruby gems'
require 'nokogiri'
require 'open-uri'
require 'ap'
require 'debugger'
require 'csv'

#collect all the authors, books, ISBN, publisher info
#====================================================
url = 'http://www.africanbookscollective.com/browse/african-literature/fiction'
page = Nokogiri::HTML(open(url))

# create an array for every book content on each page that has element of form
# [<ISBN Number>, <Book Pages>, <Book Dimensions>, <First Published>, <Publisher>,<CoverType>]
# save array into a csv file with the columns of:
# <ISBN Number> <Book Pages> <Book Dimensions> <First Published> <Publisher> <CoverType>

# opens a csv file and shovels column titles into the first row
CSV.open("bookinfo.csv", "w+") do |csv|
  csv << ["ISBN Number", "Book Pages", "Book Dimensions", "First Published", "Publisher", "CoverType"]
end

# initializes another_page and page_num varaibles
page_num = 0

# the while loop runs as long as the statement below evaluates to true
#while page_num < 390
new_page = Nokogiri::HTML(open("http://www.africanbookscollective.com/browse/african-studies?b_start:int=#{page_num+10}&amp;-C="))
  # search for the context-details of each book
  books = page.css('p.context-details').map do |book|
    book.text.gsub(/\s{2,}/, "").chomp.split(" |")
  end


  #appends context-details onto the csv we already created
  CSV.open("bookinfo.csv", "a+") do |csv|
    books.each do |book|
      csv << book
    end
  end
  page_num += 10
#end
    enter code here

此代码仅获取第1页上的信息;它无法抓住所有其余页面（1-38）。我认为这与我的while循环结构的方式有关，对吧？

为什么不使用字符串插值中的格式转到下一页在new_page中提供？

谢谢

Answer 1

忘记数字并按照“下一步”链接进行迭代。看起来应该是这样的：

# page 1
page = Nokogiri::HTML(open(start_url))
do_something_with page

# repeat until no more "next" links
while a = page.at('a[title="Next page"]')
  page = Nokogiri::HTML(open(a[:href]))
  do_something_with page
end

使用while循环与nokogiri导航到下一页

1 个答案: