我想在Google趋势中获取特定类别中的热门趋势查询。我可以下载该类别的CSV,但这不是一个可行的解决方案,因为我想分支到每个查询并找到每个查询的趋势子查询。
我无法捕获下表的内容,其中包含主题的前10个趋势查询。还有一些奇怪的原因,使用capybara截取屏幕截图会返回一个黑暗的图像。
<div id="TOP_QUERIES_0_0table" class="trends-table">
请在Ruby控制台上运行代码以查看它是否正常工作。捕获元素/截图适用于facebook.com或google.com,但不适用于趋势。
我猜这与在页面加载时动态生成的表有关,但我不确定是否应该阻止capybara捕获已经加载到页面上的元素。任何提示都非常有价值。
require 'capybara/poltergeist'
require 'capybara/dsl'
require 'csv'
class PoltergeistCrawler
include Capybara::DSL
def initialize
Capybara.register_driver :poltergeist_crawler do |app|
Capybara::Poltergeist::Driver.new(app, {
:js_errors => false,
:inspector => false,
phantomjs_logger: open('/dev/null')
})
end
Capybara.default_wait_time = 3
Capybara.run_server = false
Capybara.default_driver = :poltergeist_crawler
page.driver.headers = {
"DNT" => 1,
"User-Agent" => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0"
}
end
# handy to peek into what the browser is doing right now
def screenshot(name="screenshot")
page.driver.render("public/#{name}.jpg",full: true)
end
# find("path") and all("path") work ok for most cases. Sometimes I need more control, like finding hidden fields
def doc
Nokogiri.parse(page.body)
end
end
crawler = PoltergeistCrawler.new
url = "http://www.google.com/trends/explore#cat=0-45&geo=US&date=today%2012-m&cmpt=q"
crawler.visit url
crawler.screenshot
crawler.find(:xpath, "//div[@id='TOP_QUERIES_0_0table']")
Capybara :: ElementNotFound:无法找到xpath“// div [@ id ='TOP_QUERIES_0_0table']”
来自/Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/finders.rb:41:in block in find'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/base.rb:84:in
synchronize'
来自/Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/finders.rb:30:in find'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/session.rb:676:in
块(2级)在''
来自/Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/dsl.rb:51:in block (2 levels) in <module:DSL>'
from (irb):45
from /Users/karan/.rbenv/versions/1.9.3-p484/bin/irb:12:in
'
答案 0 :(得分:1)
javascript错误是由于错误的USER-Agent造成的。一旦我将用户代理更改为我的Chrome浏览器,它就可以了!
“User-Agent”=&gt; “Mozilla / 5.0(Macintosh; Intel Mac OS X 10_10_0)AppleWebKit / 537.36(KHTML,与Gecko一样)Chrome / 39.0.2171.71 Safari / 537.36”