用Nokogiri抓取的网页不返回任何数据

时间:2017-06-26 11:46:03

标签: ruby web-scraping nokogiri frames

我正试图从英国政府的UK Oil Portal中删除项目清单,但我的代码没有返回数据。相反,我想制作一系列项目标题。

class Entry
  def initialize(title)
    @title = title
  end
  attr_reader :title
end

def index
  @projects=Project.all
  require 'open-uri'
  require 'nokogiri'
  doc = Nokogiri::HTML(open("https://itportal.decc.gov.uk/pathfinder/currentprojectsindex.html"))

  entries = doc.css('.operator-container')
  @entries = []
  entries.each do |row|
    title = row.css('.setoutForm').text
    @entries << Entry.new(title)
  end
end

1 个答案:

答案 0 :(得分:3)

您发布的链接不包含任何数据。您看到的页面是框架集,每个框架由其自己的URL创建。您想要解析左框架,因此您应该编辑代码以打开左框架的URL:

  doc = Nokogiri::HTML(open('https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index'))

单个项目位于不同的页面上,您需要打开每个项目。例如,第一个是:

project_file = open(entries.first.css('a').attribute('href').value)       
project_doc = Nokogiri::HTML(project_file)

“setoutForm”类会删除大量文本。例如:

> project_doc.css('.setoutForm').text
=> "\n            \n              Field Type\n              Location\n              Water De
pth (m)\n              First Production\n              Contact\n            \n            \n
              Oil\n              2/15\n              155m\n              Q3/2018\n          
    \n                John Gill\n                Business Development Manager\n             
   jgill@alphapetroleum.com\n                01483 307204\n              \n            \n   
       \n            \n              Project Summary\n            \n            \n          
    \n                The Cheviot discovery is located in blocks 2/10a, 2/15a and 3/11b. \n 
               \n                Reserves are approximately 46mmbbls oil.\n                \
n                A Field Development Plan has been submitted and technically approved. The c
oncept is for a leased FPSA with 18+ subsea wells. Oil export will be via tanker offloading.
\n                \n              \n            \n          "   

但标题不在该文本中。如果你想要标题,请抓住页面的这一部分:

<div class="field-header" foxid="eu1KcH_d4qniAjiN">Cheviot</div>

你可以用这个CSS选择器做什么:

> project_doc.css('.operator-container .field-header').text
=> "Cheviot"

逐步编写此代码。除非你单步执行,否则很难找到代码出错的地方。例如,我使用Nokogiri的command line tool打开一个交互式Ruby shell,带有

nokogiri https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index