Web抓取Ruby中的Mechanize生成不同的html到浏览器

时间:2013-12-04 21:19:37

标签: ruby-on-rails ruby web-scraping mechanize

我对Ruby和Mechanize相对较新,但我在asp网站上遇到了一些困难

URL: http://www.adecco.co.uk/careercentre/job-search-results.aspx?kws=&pstc=&cty=&prvnm=&pdx=1


agent = Mechanize.new
page = agent.get('http://www.adecco.co.uk/careercentre/job-search-results.aspx?kws=&pstc=&cty=&prvnm=&pdx=1')
puts page.body

我正在研究Mechanize网站上的示例,我通过Mechanize获取的HTML与我在浏览器中使用view source的内容非常不同,我需要完成的HTML吗?

更新

我不太确定该怎么做,因为问题实际上是因为页面后来使用jquery呈现内容所以我最终使用Selenium来获取具有正确html的页面,这些都不是答案实际上是错误的,所以我对两者都进行了投票,但实际上都没有解决问题?

由于

马克

2 个答案:

答案 0 :(得分:3)

请尝试以下代码

require 'mechanize'
require 'nokogiri'

agent = Mechanize.new
page = agent.get('http://www.adecco.co.uk/careercentre/job-search-results.aspx?kws=&pstc=&cty=&prvnm=&pdx=1')

document = Nokogiri::HTML(page.content)
puts document

答案 1 :(得分:2)

我认为这是因为网站处理不同的用户代理的方式不同,您可以将用户代理设置为与浏览器相同,如下所示

a = Mechanize.new
a.user_agent_alias = 'Mac Safari'

你可以使用这些

的任何值
AGENT_ALIASES = {
  'Windows IE 6' => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
  'Windows IE 7' => 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
  'Windows Mozilla' => 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6',
  'Mac Safari' => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; de-at) AppleWebKit/531.21.8 (KHTML, like Gecko) Version/4.0.4 Safari/531.21.10',
  'Mac FireFox' => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6',
  'Mac Mozilla' => 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.4a) Gecko/20030401',
  'Linux Mozilla' => 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624',
  'Linux Firefox' => 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.1) Gecko/20100122 firefox/3.6.1',
  'Linux Konqueror' => 'Mozilla/5.0 (compatible; Konqueror/3; Linux)',
  'iPhone' => 'Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1C28 Safari/419.3',
  'Mechanize' => "WWW-Mechanize/#{VERSION} (http://rubyforge.org/projects/mechanize/)"
}

以上列表存在于此处 https://github.com/sparklemotion/mechanize/blob/master/lib/mechanize.rb#L115