如何在使用Ruby Mechanize Web爬网时绕过网络错误

时间:2018-02-21 15:07:10

标签: ruby web-crawler mechanize rescue

我正在使用Ruby机械网络爬虫从流行的房地产网站中提取数据。我使用家庭住址作为关键词来搜集Zillow,Redfin等公共数据。 我基本上试图绕过任何HTTP和网络错误。以下救援功能似乎无法完成这项工作。

def scrape_single(key_word)
    #setup agent
    agent = Mechanize.new{ |agent|
        agent.user_agent_alias = 'Mac Safari'
    }
    agent.ignore_bad_chunking = true
    agent.verify_mode = OpenSSL::SSL::VERIFY_NONE 
    agent.request_headers = { "Accept-Encoding" => ""}
    agent.follow_meta_refresh = true
    agent.keep_alive = false

    #page setup
    begin
      agent.get(@@search_engine) do |page|
        @@search_result = page.form('f') do |search|
          search.q = key_word
        end.submit
      end 
    rescue Timeout::Error
      puts "Timeout"
      retry
    rescue Net::HTTPGatewayTimeOut => e
      if e.response_code == '504' || '502'
        e.skip
        sleep 5
      end
    rescue Net::HTTPBadGateway  => e
      if e.response_code == '504' || '502'
        e.skip
        sleep 5
      end
    rescue Net::HTTPNotFound => e
      if e.response_code == '404'
        e.skip
        sleep 5
      end
    rescue Net::HTTPFatalError => e
      if e.response_code == '503'
        e.skip
      end
    rescue Mechanize::ResponseCodeError => e
      if e.response_code == '404'
        e.skip
        sleep 5
      elsif e.response_code == '502'
        e.skip
        sleep 5
      else
        retry
      end
    rescue Errno::ETIMEDOUT
      retry
    end

    return @@search_result      # returns Mechanize::Page
  end 

以下是我在MA中获取地址的关键字的错误消息示例。

  

/home/ec2-user/.gem/ruby/2.1/gems/mechanize-2.7.5/lib/mechanize/http/agent.rb:323:in`fetch' ;:404 => https://www.redfin.com/MA/WASHINGTON/306-WERDEN-RD-Unknown/home/134059623的Net :: HTTPNotFound - 未处理的响应(Mechanize :: ResponseCodeError)

输入上述网址时看到的实际信息是:

  

不能GET / MA / WASHINGTON / 306-WERDEN-RD-Unknown / home / 134059623

我的目标是简单地忽略并跳过偶发错误并转到下一个关键字。我无法在网上找到有效的解决方案,我们将非常感谢您的反馈意见。

1 个答案:

答案 0 :(得分:1)

如果我理解引发的错误是 Mechanize :: ResponseCodeError ,这显然是 404 response_code。但是在你的脚本中你不会从 Mechanize :: ResponseCodeError

中引发404 response_code
all_response_code = ['403', '404', '502']

rescue Mechanize::ResponseCodeError => e
  if all_response_code.include? response_code 
    e.skip
    sleep 5
  else
    retry
  end

也许如果你为404 response_code添加一个条件,它就会做到这一点

修改 为了减少行数,我改变了一些代码