我正在使用Ruby机械网络爬虫从流行的房地产网站中提取数据。我使用家庭住址作为关键词来搜集Zillow,Redfin等公共数据。 我基本上试图绕过任何HTTP和网络错误。以下救援功能似乎无法完成这项工作。
def scrape_single(key_word)
#setup agent
agent = Mechanize.new{ |agent|
agent.user_agent_alias = 'Mac Safari'
}
agent.ignore_bad_chunking = true
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
agent.request_headers = { "Accept-Encoding" => ""}
agent.follow_meta_refresh = true
agent.keep_alive = false
#page setup
begin
agent.get(@@search_engine) do |page|
@@search_result = page.form('f') do |search|
search.q = key_word
end.submit
end
rescue Timeout::Error
puts "Timeout"
retry
rescue Net::HTTPGatewayTimeOut => e
if e.response_code == '504' || '502'
e.skip
sleep 5
end
rescue Net::HTTPBadGateway => e
if e.response_code == '504' || '502'
e.skip
sleep 5
end
rescue Net::HTTPNotFound => e
if e.response_code == '404'
e.skip
sleep 5
end
rescue Net::HTTPFatalError => e
if e.response_code == '503'
e.skip
end
rescue Mechanize::ResponseCodeError => e
if e.response_code == '404'
e.skip
sleep 5
elsif e.response_code == '502'
e.skip
sleep 5
else
retry
end
rescue Errno::ETIMEDOUT
retry
end
return @@search_result # returns Mechanize::Page
end
以下是我在MA中获取地址的关键字的错误消息示例。
/home/ec2-user/.gem/ruby/2.1/gems/mechanize-2.7.5/lib/mechanize/http/agent.rb:323:in`fetch' ;:404 => https://www.redfin.com/MA/WASHINGTON/306-WERDEN-RD-Unknown/home/134059623的Net :: HTTPNotFound - 未处理的响应(Mechanize :: ResponseCodeError)
输入上述网址时看到的实际信息是:
不能GET / MA / WASHINGTON / 306-WERDEN-RD-Unknown / home / 134059623
我的目标是简单地忽略并跳过偶发错误并转到下一个关键字。我无法在网上找到有效的解决方案,我们将非常感谢您的反馈意见。
答案 0 :(得分:1)
如果我理解引发的错误是 Mechanize :: ResponseCodeError ,这显然是 404 response_code。但是在你的脚本中你不会从 Mechanize :: ResponseCodeError
中引发404 response_codeall_response_code = ['403', '404', '502']
rescue Mechanize::ResponseCodeError => e
if all_response_code.include? response_code
e.skip
sleep 5
else
retry
end
也许如果你为404 response_code添加一个条件,它就会做到这一点
修改强> 为了减少行数,我改变了一些代码