尝试使用XPATH,Nokogiri,Mechanize进行Webscraping

时间:2014-07-31 20:15:08

标签: ruby-on-rails xpath web-scraping nokogiri

我一直在努力解析来自saferweb网站的一些信息,并且遇到了让它运行起来的问题。

如果我能得到第一个值,我可以调整它来得到其余的......

此示例应在实体类型

旁边返回Carrier

来源:

http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=MC_MX&query_string=733709

机械化w / hpricot

  require 'rubygems'
  require 'mechanize'
  require 'hpricot'
  agent = Mechanize.new
  page = agent.get('http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=MC_MX&query_string=733709')
  @response = page.content
  doc = Hpricot(@response)
  a = (doc/"/html/body/p/table/tbody/tr[2]/td/table/tbody/tr[2]/td/center[1]/table/tbody/tr[2]/td")[0].innerHTML
  a

引入nokogiri

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=MC_MX&query_string=733709"))
ebit = doc.at("/html/body/p/table/tbody/tr[2]/td/table/tbody/tr[2]/td/center[1]/table/tbody/tr[2]/td").text
puts ebit

1 个答案:

答案 0 :(得分:2)

看起来值列都具有相同的CSS类,因此使用它可能更容易搜索。这对我有用。

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=MC_MX&query_string=733709"))
# Get Entity Type field
ebit = doc.at('.queryfield').text
# Get rid of all the white space
ebit.gsub!("\u00A0", "").strip!
puts ebit