Mechanize不会与网站建立联系

时间:2016-05-21 16:13:17

标签: ruby-on-rails ruby rubygems mechanize mechanize-ruby

欢迎,我遇到了问题,宝石机械化无法连接到网站。 Gem已安装。 代码:

require 'mechanize'

agent = Mechanize.new
main_page = agent.get 'https://imbd.com'
main_page.link_with(text: "Top 250").click
rows = list_page.root.css(".lister-list tr")

puts rows.size

这是一个错误:

C:/Ruby/lib/ruby/2.2.0/net/http.rb:879:in `initialize': A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. - connect(2) for "imbd.com" port 80 (Errno::ETIMEDOUT)
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:879:in `open'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:879:in `block in connect'
    from C:/Ruby/lib/ruby/2.2.0/timeout.rb:73:in `timeout'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:878:in `connect'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:863:in `do_start'
    from C:/Ruby/lib/ruby/2.2.0/net/http.rb:858:in `start'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:700:in `start'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:631:in `connection_for'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:994:in `request'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/mechanize-2.7.4/lib/mechanize/http/agent.rb:267:in `fetch'
    from C:/Ruby/lib/ruby/gems/2.2.0/gems/mechanize-2.7.4/lib/mechanize.rb:464:in `get'
    from C:/Ruby/Workspace/imbd.rb:4:in `<main>'

任何人都知道出了什么问题?谢谢!

2 个答案:

答案 0 :(得分:0)

在查看imdb之后,我发现他们正在运行大量的javascript,这会导致机械化,因为它无法解析j并了解传入的响应。如果您正在寻找内容或自动浏览,我建议使用Capybara而不是Mechanize。将Capybara与Poltergeist结合起来(你需要用这种方法安装phantom.js)将比Mechanize更好地工作,并且可以自动与加载大量js的页面进行交互。

我添加了一种可能为您解决错误的方法。如果这是有效的,因为Mechanize试图在js脚本完成之前获取页面,因此无法获得有效数据。

编辑:

  agent = Mechanize.new
  agent.read_timeout=3  #set the agent time out
  begin
  main_page = agent.get 'https://imbd.com'
  main_page.link_with(text: "Top 250").click
  rows = list_page.root.css(".lister-list tr")
  rescue Timeout::Error 
    puts "Timeout!"
    puts "read_timeout attribute is set to #{agent.read_timeout}s" if !agent.read_timeout.nil?
  end

答案 1 :(得分:0)

虽然机械化不支持javascript,但问题在于您尝试访问的网站并不存在。您正尝试访问www.imbd.com而不是www.imdb.com。因此,错误消息是准确的。

FWIW,IMDB并不希望你刮掉他们的网站:

  

机器人和屏幕抓取:除非得到我们明确的书面同意,否则您不得在本网站上使用数据挖掘,机器人,屏幕抓取或类似的数据收集和提取工具。