我编写的代码可以抓取并解析此网站上的信息=> www.africancollective.come /眉毛/非裔文献/小说
require 'ruby gems'
require 'nokogiri'
require 'open-uri'
require 'ap'
require 'debugger'
require 'csv'
#collect all the authors, books, ISBN, publisher info
#====================================================
url = 'http://www.africanbookscollective.com/browse/african-literature/fiction'
page = Nokogiri::HTML(open(url))
# create an array for every book content on each page that has element of form
# [<ISBN Number>, <Book Pages>, <Book Dimensions>, <First Published>, <Publisher>,<CoverType>]
# save array into a csv file with the columns of:
# <ISBN Number> <Book Pages> <Book Dimensions> <First Published> <Publisher> <CoverType>
# opens a csv file and shovels column titles into the first row
CSV.open("bookinfo.csv", "w+") do |csv|
csv << ["ISBN Number", "Book Pages", "Book Dimensions", "First Published", "Publisher", "CoverType"]
end
# initializes another_page and page_num varaibles
page_num = 0
# the while loop runs as long as the statement below evaluates to true
#while page_num < 390
new_page = Nokogiri::HTML(open("http://www.africanbookscollective.com/browse/african-studies?b_start:int=#{page_num+10}&-C="))
# search for the context-details of each book
books = page.css('p.context-details').map do |book|
book.text.gsub(/\s{2,}/, "").chomp.split(" |")
end
#appends context-details onto the csv we already created
CSV.open("bookinfo.csv", "a+") do |csv|
books.each do |book|
csv << book
end
end
page_num += 10
#end
enter code here
此代码仅获取第1页上的信息;它无法抓住所有其余页面(1-38)。我认为这与我的while循环结构的方式有关,对吧?
为什么不使用字符串插值中的格式转到下一页 在new_page中提供?
谢谢
答案 0 :(得分:1)
忘记数字并按照“下一步”链接进行迭代。看起来应该是这样的:
# page 1
page = Nokogiri::HTML(open(start_url))
do_something_with page
# repeat until no more "next" links
while a = page.at('a[title="Next page"]')
page = Nokogiri::HTML(open(a[:href]))
do_something_with page
end