使用ruby mechanize爬网数据

时间:2014-08-28 07:24:21

标签: ruby nokogiri mechanize-ruby

我正在抓取http://www.mca.gov.in/DCAPortalWeb/dca/MyMCALogin.do?method=setDefaultProperty&mode=53

中的数据

以下是我尝试的代码:

uri = "http://www.mca.gov.in/DCAPortalWeb/dca/MyMCALogin.do?method=setDefaultProperty&mode=53"
    #html, html_content = @mobj.get_data(uri)

    agent = Mechanize.new 
    html_page  = agent.get uri
    html_form = html_page.form 
    html_form.radiobuttons_with(:name => 'search',:value => '2')[0].check
    html_form.submit
    puts html_page.content

错误:

var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in `fetch': 500 => Net::HTTPInternalServerError for http://www.mca.gov.in/DCAPortalWeb/dca/ProsecutionDetailsSRAction.do -- unhandled response (Mechanize::ResponseCodeError)
from /var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize.rb:1281:in `post_form'
from /var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize.rb:548:in `submit'
from /var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize/form.rb:223:in `submit'
from ministry_corp_aff.rb:32:in `start'
from ministry_corp_aff.rb:52:in `<main>'

如果我手动点击第3个单选按钮然后提交它,我会得到一个.zip文件。我试图从该zip文件中获取.xls文件中的数据..

1 个答案:

答案 0 :(得分:0)

单选按钮有一个onclick even处理程序,可触发某些javascript的执行。此外,单击提交<a>标记也会导致执行某些JavaScript。该javascript可能会设置一些与表单一起返回的值,服务器会检查这些值。

Mechanize无法执行javascript。你需要selenium webdriver。