我想从此页面获取数据:
http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=0656887000494793
但该页面转发到:
http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?execution=eXs1
因此,当我使用来自OpenUri的open
尝试获取数据时,会引发RuntimeError
错误,说HTTP redirection loop:
我不确定如何在重定向并抛出该错误后获取该数据。
答案 0 :(得分:23)
您需要一个像Mechanize这样的工具。从它的描述:
Mechanize库用于 自动化与网站的互动。 机械化自动存储和 发送cookie,跟随重定向,可以 关注链接,并提交表单。形成 可以填充和提交字段。 机械化也跟踪 您访问过的网站 历史。
这正是您所需要的。所以,
sudo gem install mechanize
然后
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get "http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber trackingNumber=0656887000494793"
page.content # Get the resulting page as a string
page.body # Get the body content of the resulting page as a string
page.search(".somecss") # Search for specific elements by XPath/CSS using nokogiri
你已经准备好摇滚了。
答案 1 :(得分:1)
该网站似乎正在使用会话进行一些重定向逻辑。如果您没有发回他们在第一次请求时发送的会话cookie,您将最终进入重定向循环。恕我直言,这对他们来说是一个糟糕的实施。
然而,我试图将cookie传递给他们,但我没有让它工作,所以我不能完全确定这就是这就是。
答案 2 :(得分:1)
虽然机械化是一个很好的工具,但我更喜欢“烹饪”我自己的东西。
如果您认真解析,可以查看此代码。它每天在国际层面上抓取成千上万的网站,据我所研究和调整,没有一种更稳定的方法可以让你在以后高度定制你的需求。
require "open-uri"
require "zlib"
require "nokogiri"
require "sanitize"
require "htmlentities"
require "readability"
def crawl(url_address)
self.errors = Array.new
begin
begin
url_address = URI.parse(url_address)
rescue URI::InvalidURIError
url_address = URI.decode(url_address)
url_address = URI.encode(url_address)
url_address = URI.parse(url_address)
end
url_address.normalize!
stream = ""
timeout(8) { stream = url_address.open(SHINSO_HEADERS) }
if stream.size > 0
url_crawled = URI.parse(stream.base_uri.to_s)
else
self.errors << "Server said status 200 OK but document file is zero bytes."
return
end
rescue Exception => exception
self.errors << exception
return
end
# extract information before html parsing
self.url_posted = url_address.to_s
self.url_parsed = url_crawled.to_s
self.url_host = url_crawled.host
self.status = stream.status
self.content_type = stream.content_type
self.content_encoding = stream.content_encoding
self.charset = stream.charset
if stream.content_encoding.include?('gzip')
document = Zlib::GzipReader.new(stream).read
elsif stream.content_encoding.include?('deflate')
document = Zlib::Deflate.new().deflate(stream).read
#elsif stream.content_encoding.include?('x-gzip') or
#elsif stream.content_encoding.include?('compress')
else
document = stream.read
end
self.charset_guess = CharGuess.guess(document)
if not self.charset_guess.blank? and (not self.charset_guess.downcase == 'utf-8' or not self.charset_guess.downcase == 'utf8')
document = Iconv.iconv("UTF-8", self.charset_guess, document).to_s
end
document = Nokogiri::HTML.parse(document,nil,"utf8")
document.xpath('//script').remove
document.xpath('//SCRIPT').remove
for item in document.xpath('//*[translate(@src, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz")]')
item.set_attribute('src',make_absolute_address(item['src']))
end
document = document.to_s.gsub(/<!--(.|\s)*?-->/,'')
self.content = Nokogiri::HTML.parse(document,nil,"utf8")
end