使用Capybara和Poltergeist解决谷歌趋势的麻烦

时间:2014-12-08 21:21:06

标签: ruby xpath capybara screen-scraping poltergeist

我想在Google趋势中获取特定类别中的热门趋势查询。我可以下载该类别的CSV,但这不是一个可行的解决方案,因为我想分支到每个查询并找到每个查询的趋势子查询。

我无法捕获下表的内容,其中包含主题的前10个趋势查询。还有一些奇怪的原因,使用capybara截取屏幕截图会返回一个黑暗的图像。

<div id="TOP_QUERIES_0_0table" class="trends-table">

请在Ruby控制台上运行代码以查看它是否正常工作。捕获元素/截图适用于facebook.com或google.com,但不适用于趋势。

我猜这与在页面加载时动态生成的表有关,但我不确定是否应该阻止capybara捕获已经加载到页面上的元素。任何提示都非常有价值。

require 'capybara/poltergeist'
require 'capybara/dsl'
require 'csv'


class PoltergeistCrawler
  include Capybara::DSL

  def initialize
    Capybara.register_driver :poltergeist_crawler do |app|
      Capybara::Poltergeist::Driver.new(app, {
        :js_errors => false,
        :inspector => false,
        phantomjs_logger: open('/dev/null')
      })
    end
    Capybara.default_wait_time = 3
    Capybara.run_server = false
    Capybara.default_driver = :poltergeist_crawler
    page.driver.headers = {
      "DNT" => 1,
      "User-Agent" => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0"
    }
  end

  # handy to peek into what the browser is doing right now
  def screenshot(name="screenshot")
    page.driver.render("public/#{name}.jpg",full: true)
  end

  # find("path") and all("path") work ok for most cases. Sometimes I need more control, like finding hidden fields
  def doc
    Nokogiri.parse(page.body)
  end
end

crawler = PoltergeistCrawler.new
url = "http://www.google.com/trends/explore#cat=0-45&geo=US&date=today%2012-m&cmpt=q"
crawler.visit url

crawler.screenshot

crawler.find(:xpath, "//div[@id='TOP_QUERIES_0_0table']")

Capybara :: ElementNotFound:无法找到xpath“// div [@ id ='TOP_QUERIES_0_0table']”     来自/Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/finders.rb:41:in block in find' from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/base.rb:84:in synchronize'     来自/Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/finders.rb:30:in find' from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/session.rb:676:in块(2级)在''     来自/Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/dsl.rb:51:in block (2 levels) in <module:DSL>' from (irb):45 from /Users/karan/.rbenv/versions/1.9.3-p484/bin/irb:12:in'

1 个答案:

答案 0 :(得分:1)

javascript错误是由于错误的USER-Agent造成的。一旦我将用户代理更改为我的Chrome浏览器,它就可以了!

“User-Agent”=&gt; “Mozilla / 5.0(Macintosh; Intel Mac OS X 10_10_0)AppleWebKit / 537.36(KHTML,与Gecko一样)Chrome / 39.0.2171.71 Safari / 537.36”