数据抓取形成输入阻碍轨道

时间:2016-06-06 11:01:31

标签: ruby-on-rails ruby web-scraping screen-scraping

建立一个网站,为大学和课程提供UCAS网站数据,我们试图将其限制为仅限苏格兰的大学,但以下代码似乎不起作用。表单中的位置是ucas网站上该表单段的输入ID的名称,但现在它仍然显示所有大学。

  class PagesController < ApplicationController
      def home
    require 'mechanize'


mechanize = Mechanize.new

@uninames_array = []

page = mechanize.get('http://search.ucas.com/')

form = page.forms.first
form['Vac'] = '2'
form['AvailableIn'] = '2016'
form['Location'] =  'scotland'
page = form.submit

page.search('li.result h3').each do |h3|
#  puts h3.text.strip

end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

page.search('li.result h3').each do |h3|
#    puts h3.text.strip
name = h3.text
  @uninames_array.push(name)
  end
end
  end


end

1 个答案:

答案 0 :(得分:0)

似乎页面中的CountryCode变量是在javascript中初始化的。这就是请求没有显示预期结果的原因。

Mechanize无法处理javascript环境,但您可以将搜索请求作为get请求发送,您必须将所有参数指定为CountryCode。

示例:

require 'mechanize'
mechanize = Mechanize.new

@uninames_array = []

#page = mechanize.get('http://search.ucas.com/search/providers?CountryCode=&RegionCode=&Lat=&Lng=&Feather=&Vac=2&Query=&ProviderQuery=&AcpId=&Location=scotland&IsFeatherProcessed=True&SubjectCode=&AvailableIn=2016')
page = mechanize.get('http://search.ucas.com/search/providers?CountryCode=3&RegionCode=&Lat=&Lng=&Feather=&Vac=2&Query=&ProviderQuery=&AcpId=&Location=scotland&IsFeatherProcessed=True&SubjectCode=&AvailableIn=2016')


page.search('li.result h3').each do |h3|
  name = h3.text
  @uninames_array.push(name)
end

while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])

  page.search('li.result h3').each do |h3|
    name = h3.text
    @uninames_array.push(name)
  end
end

puts @uninames_array.to_s

如果您需要访问所有国家/地区的数据,页面中会有一个包含它们的javascript:

var countries = [],
regions, geoCordinates;
countries.england = 1;
countries.wales = 2;
countries.scotland = 3;
countries["northern ireland"] = 4;
countries.ni = 4;
countries.ireland = 4;
countries.uk = "1|2|3|4|5";
countries["united kingdom"] = "1|2|3|4|5";
regions = [];
regions["central scotland"] = 301;
regions["channel isles"] = 901;
regions["channel islands"] = 901;
regions["dumfries and galloway"] = 302;
regions["east midlands"] = 101;
regions["east england"] = 102;
regions["east sussex"] = 111;
regions["east wales"] = 201;
regions.fife = 303;
regions.grampian = 304;
regions["isle man"] = 902;
regions.london = 103;
regions.lothian = 305;
regions["mid wales"] = 202;
regions["north east"] = 104;
regions["north east england"] = 104;
regions["north wales"] = 203;
regions["north west"] = 105;
regions["north west england"] = 105;
regions.orkney = 306;
regions["scottish borders"] = 307;
regions["scottish highlands"] = 308;
regions["shetland islands"] = 309;
regions["south east"] = 106;
regions["south east england"] = 106;
regions["south east wales"] = 204;
regions["south wales"] = 205;
regions["south west"] = 107;
regions["south west england"] = 107;
regions.strathclyde = 310;
regions.tayside = 311;
regions["west midlands"] = 108;
regions["west sussex"] = 112;
regions["west wales"] = 206;
regions["yorkshire and humber"] = 109;
regions["yorkshire and the humber"] = 109;
regions.yorkshire = 109;
regions.bedfordshire = 114;
regions.essex = 10201;
regions.kent = 10601;
regions.hampshire = 10602;
regions.cornwall = 10701;
regions["north yorkshire"] = 10901;
regions.midlands = "101|108";
regions.sussex = "111|112";
regions["north england"] = "104|105|109";
regions["northern england"] = "104|105|109";
regions["south england"] = "102|103|106|107|114";
regions["southern england"] = "102|103|106|107|114";
geoCordinates = [];
geoCordinates.jordanstown = "54.68627,-5.88206,0"