机械化ruby无法查看linkedin中的所有内容

时间:2014-06-29 15:15:04

标签: ruby-on-rails ruby web-scraping mechanize

我已经在rails app中安装了mechanize gem并对其进行测试我只是将下面的代码复制并粘贴到irb控制台中。它登录到页面,我可以将Orange放入搜索字段并提交,但是下一页没有内容与" Orange"我在浏览器中看到的任何橙色员工也没有。 linkedin有一些安全功能可以阻止这种情况,还是我做错了什么?

    require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'open-uri'

#create agent
agent = Mechanize.new { |agent| 
    agent.user_agent_alias = 'Mac Safari 4'
}
agent.follow_meta_refresh = true
#visit page
page = agent.get("https://www.linkedin.com/")

#login
login_form = page.form('login')
login_form.session_key = "email"
login_form.session_password = "password"
page = agent.submit(login_form, login_form.buttons.first)

# get the form
form = agent.page.form_with(:name => "commonSearch")
#fill form out
form.keywords = 'Orange France'
# get the button you want from the form
button = form.button_with(:value => "Search")
# submit the form using that button
agent.submit(form, button)

agent.page.link_with(:text => "Orange")
=> nil

1 个答案:

答案 0 :(得分:1)

Mechanize的问题是它无法直接使用JavaScript加载的内容,就像使用LinkedIn搜索在此场景中找到的那样。

此解决方案是查看页面的正文并使用正则表达式获取所需内容,然后将结果解析为JSON。

例如:

url = "http://www.linkedin.com/vsearch/p?type=people&keywords=dario+barrionuevo"

results = agent.get(url).body.scan(/\{"person"\:\{.*?\}\}/)

person = results.first # You'd use an each here, but for the example we'll get the first

json = JSON.parse(person)
json['person']['firstName'] # => 'Dario'
json['person']['lastName'] # => 'Barrionuevo'