我正在尝试建立一个信息网站,向访问者显示该特定网页上特定商家的所有优惠。我设法从第一页抓取标题并将URL迭代打包到数组中。
我的代码应该获取每个URL并将其粘贴到scraper中,列出该页面的项目,迭代到下一页,抓取标题并将它们附加到最近完成的列表,依此类推。
我的控制器看起来像这样:
class ApplicationController < ActionController::Base
# Prevent CSRF attacks by raising an exception.
# For APIs, you may want to use :null_session instead.
protect_from_forgery with: :exception
class Entry
def initialize(title)
@title = title
end
attr_reader :title
end
def scrape_mydealz
require 'open-uri'
urlarray = Array.new
# --------------------------------------------------------------- URL erstellen
pagination = '&page=1'
count = [1, 2]
count.each do |i|
base_url = "https://www.mydealz.de/search?q=media+markt"
pagination = "&page=#{i}"
combination = base_url + pagination
urlarray << combination
end
# --------------------------------------------------------------- / URL erstellen
urlarray.each do |test|
doc = Nokogiri::HTML(open("#{test}"))
entries = doc.css('article.thread')
@entriesArray = []
entries.each do |entry|
title = entry.css('a.vwo-thread-title').text
@entriesArray << Entry.new(title)
end
end
render template: 'scrape_mydealz'
end
end
使用此代码,它将迭代到第2页,并仅显示第2页的刮擦结果。
结果可以在这里找到: https://mm-scraper-neevoo.c9users.io/
答案 0 :(得分:0)
您在每次迭代中重新初始化@entriesArray
。最简单的解决方案,将初始化移到循环外
@entriesArray = []
urlarray.each do |test|
doc = Nokogiri::HTML(open("#{test}"))
entries = doc.css('article.thread')
entries.each do |entry|
title = entry.css('a.vwo-thread-title').text
@entriesArray << Entry.new(title)
end
end
答案 1 :(得分:0)
这是未经测试的,但它是我用来扫描两页网站并累积标题的一般想法:
require 'open-uri'
BASE_URL = 'https://www.mydealz.de/search?q=media+markt&page=1'
def scrape_mydealz
urls = []
2.times do |i|
url = URI.parse(BASE_URL)
base_query = URI::decode_www_form(url.query).to_h
base_query['page'] = 1 + i
url.query = URI.encode_www_form(base_query)
urls << url
end
@entries_array = []
urls.each do |url|
doc = Nokogiri::HTML(open(url))
doc.css('article.thread').each do |entry|
@entries_array << Entry.new(entry.at('a.vwo-thread-title').text)
end
end
render template: 'scrape_mydealz'
end
谨慎使用text
与search
,css
或xpath
:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.search('p').text # => "foobar"
doc.search('p').map(&:text) # => ["foo", "bar"]
请注意,第一个结果已连接<p>
标记的内容。之后通常不会尝试将它们分开。