Question

我想抓一些谷歌搜索页面的“你的意思”部分（拼写检查）。例如，如果我搜索“心血管疾病”，它将链接到“https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=cardiovascular%20diesese”。我想刮掉“搜索心血管疾病”部分。我怎么能通过使用Nokogiri和xpath来实现这个目标？

Answer 1

如果您可以使用非JavaScript网址，则应该有效：

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("https://www.google.com/search?q=cardiovascular+diesese"))
doc.xpath("string(//span[@class='spell_orig']/a)") # => "cardiovascular diesese"

如果您可以渲染JavaScript并需要使用原始示例网址，则在将文档加载到Nokogiri（在Chrome中使用$x进行测试）后，此xpath应该可以正常工作：

doc.xpath("//a[@class='spell_orig'][boolean(@href)]/text()") # => "cardiovascular diesese"

Answer 2

由于您只想提取单个结果，因此可以使用 at_xpath shortcut which under the hood is still doing xpath/css.first。要通过开发工具定位元素，您需要转到元素选项卡 -> 右键单击元素 -> 复制 -> 复制 Xpath。

抓取文本：

doc.at_xpath("//*[@id='fprs']/a[2]/text()")  #=> cardiovascular disease

获取链接：

doc.at_xpath("//*[@id='fprs']/a[2]/@href")  #=> /search?hl=en&q=cardiovascular+diesese&nfpr=1&sa=X&ved=2ahUKEwjqhZfu0KbyAhVLRKwKHWbBDNsQvgUoAXoECAEQMg

如果您想使用 CSS 选择器，则与此等效：

doc.at_css("a.spell_orig")["href"]     #=> cardiovascular disease
# or
doc.css("a.spell_orig").first["href"]  #=> cardiovascular disease

代码和example in the online IDE：

require 'nokogiri'
require 'httparty'

headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  q: "cardiovascular diesese"
}

response = HTTParty.get("https://www.google.com/search",
                        query: params,
                        headers: headers)
doc = Nokogiri::HTML(response.body)

search_instead_xpath = "https://www.google.com#{doc.at_xpath("//*[@id='fprs']/a[2]/@href")}"

search_instead_css_1 = doc.at_css("a.spell_orig")["href"]
# or 
search_instead_css_2 = doc.css("a.spell_orig").first["href"]

puts search_instead_xpath, search_instead_css_1, search_instead_css_2

-------

=begin
https://www.google.com/search?hl=en&q=cardiovascular+diesese&nfpr=1&sa=X&ved=2ahUKEwjqhZfu0KbyAhVLRKwKHWbBDNsQvgUoAXoECAEQMg
/search?hl=en&q=cardiovascular+diesese&nfpr=1&sa=X&ved=2ahUKEwjqhZfu0KbyAhVLRKwKHWbBDNsQvgUoAXoECAEQMg
/search?hl=en&q=cardiovascular+diesese&nfpr=1&sa=X&ved=2ahUKEwjqhZfu0KbyAhVLRKwKHWbBDNsQvgUoAXoECAEQMg
=end

或者，您可以使用来自 SerpApi 的 Google Organic Results API。它是一个付费 API，具有支持不同语言的免费计划。不同的是，在这种情况下，缺少了如何从页面中提取某些元素的计算部分。需要做的就是迭代结构化的 json。

要集成的代码：

require 'google_search_results' 


params = {
  api_key: ENV["API_KEY"],
  engine: "google",
  q: "cardiovascular diesese",
  hl: "en"
}

search = GoogleSearch.new(params)
hash_results = search.get_hash

search_instead_for = hash_results[:search_information][:spelling_fix]
puts search_instead_for

-------
#=> cardiovascular disease

<块引用>

免责声明，我为 SerpApi 工作。

用Nokogiri刮google搜索

2 个答案: