RUBY - 网页抓取 - (OpenURI :: HTTPError)

时间:2012-10-04 21:23:34

标签: ruby web web-scraping

我正在尝试在ruby中编写一个简单的Web抓取代码。 它工作到第29个网址,然后我收到此错误消息:

C:/Ruby193/lib/ruby/1.9.1/open-uri.rb:346:in `open_http': 500 Internal Server Er
ror (OpenURI::HTTPError)
        from C:/Ruby193/lib/ruby/1.9.1/open-uri.rb:775:in `buffer_open'
        from C:/Ruby193/lib/ruby/1.9.1/open-uri.rb:203:in `block in open_loop'
        from C:/Ruby193/lib/ruby/1.9.1/open-uri.rb:201:in `catch'
        from C:/Ruby193/lib/ruby/1.9.1/open-uri.rb:201:in `open_loop'
        from C:/Ruby193/lib/ruby/1.9.1/open-uri.rb:146:in `open_uri'
        from C:/Ruby193/lib/ruby/1.9.1/open-uri.rb:677:in `open'
        from C:/Ruby193/lib/ruby/1.9.1/open-uri.rb:33:in `open'
        from test.rb:24:in `block (2 levels) in <main>'
        from test.rb:18:in `each'
        from test.rb:18:in `block in <main>'
        from test.rb:14:in `each'
        from test.rb:14:in `<main>'

我的代码:

require 'rubygems'  
require 'nokogiri'  
require 'open-uri'  

aFile=File.new('data.txt', 'w')

ag = 0
  for i in 1..40 do
    agenzie = ag + 1

    #change url parameter 

    url = "http://www.infotrav.it/dettaglio.do?sort=*RICOVIAGGI*&codAgenzia=" + "#{ ag }"  
    doc = Nokogiri::HTML(open(url))
    aFile=File.open('data.txt', 'a')
    aFile.write(doc.at_css("table").text)
    aFile.close
  end

你有什么想法可以解决它吗? 谢谢!

AS

3 个答案:

答案 0 :(得分:4)

该代码有一个小错字。它应该是ag = ag + 1而不是agenzie = ag + 1。我假设您在将代码复制到stackoverflow时发生了,因为代码不适用于拼写错误。

我能够在本地运行代码,并得到了同样的错误。事实证明url being accessed网站上没有http://www.infotrav.it(当codAgenzia = 30时);它返回HTTP错误500。

所以问题不在于您的代码,而在于远程服务器(http://www.infotrav.it

正如slivu在他的回答中提到的,你应该挽救错误并继续刮擦。

答案 1 :(得分:3)

如果您无法解决远程服务器上的问题,请尝试从错误中解救并继续报废:

begin
  doc = Nokogiri::HTML(open(url))
  aFile=File.open('data.txt', 'a')
  aFile.write(doc.at_css("table").text)
  aFile.close
rescue => e
  puts e.message
end

答案 2 :(得分:3)

在这里,让我为你清理一下:

File.open('data.txt', 'w') do |aFile|
  (1..40).each do |ag|
    url = "http://www.infotrav.it/dettaglio.do?sort=*RICOVIAGGI*&codAgenzia=#{ag}"
    response = open(url) rescue nil
    next unless response
    doc = Nokogiri::HTML(response)
    aFile << doc.at_css("table").text
  end
end

注释:

  • 使用块样式File.open表示文件将在关闭时自行关闭 阻止退出
  • 使用each迭代而不是for循环