如何使用Nokogiri解析页面的HTML内容

时间:2016-11-09 13:44:43

标签: ruby xpath nokogiri

require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open url)

我正在尝试获取基本的信息集,例如:

event_name
categories
sponsor
venue
event_location
cost

例如,对于event_name我有这个xpath:

"/html/body/div[2]/div[2]/div[1]/h3/a/span"

并使用它:

puts doc.xpath "/html/body/div[2]/div[2]/div[1]/h3/a/span"

这会为event_name返回nil。

如果我在本地保存URL内容,那么XPath就可以了。

除此之外,我还需要上面提到的信息。我也检查了其他XPath,但结果结果是空白。

2 个答案:

答案 0 :(得分:2)

以下是我如何做到这一点:

require 'nokogiri'
doc = Nokogiri::XML(open('/Users/gferguson/smithsonian-events.xml'))
namespaces = doc.collect_namespaces

entries = doc.search('entry').map { |entry|
  entry_title = entry.at('title').text
  entry_time_start, entry_time_end = ['startTime', 'endTime'].map{ |p| 
    entry.at('gd|when', namespaces)[p]
  }
  entry_notes = entry.at('gc|notes', namespaces).text

  {
    title: entry_title,
    start_time: entry_time_start,
    end_time: entry_time_end,
    notes: entry_notes
  }

}

运行时会导致entries成为哈希数组:

require 'awesome_print'
ap entries [0, 3]

# >> [
# >>   [0] {
# >>     :title      => "Conservation Clinics",
# >>     :start_time => "2016-11-09T14:00:00Z",
# >>     :end_time   => "2016-11-09T17:00:00Z",
# >>     :notes      => "Have questions about the condition of a painting, frame, drawing,\n print, or object that you own? Our conservators are available by\n appointment to consult with you about the preservation of your art.\n \n To request an appointment or to learn more,\n e-mail DWRCLunder@si.edu and specify CLINIC in the subject line."
# >>   },
# >>   [1] {
# >>     :title      => "Castle Highlights Tour",
# >>     :start_time => "2016-11-09T14:00:00Z",
# >>     :end_time   => "2016-11-09T14:45:00Z",
# >>     :notes      => "Did you know that the Castle is the Smithsonian’s first and oldest building? Join us as one of our dynamic volunteer docents takes you on a tour to explore the highlights of the Smithsonian Castle. Come learn about the founding and early history of the Smithsonian; its original benefactor, James Smithson; and the incredible history and architecture of the Castle. Here is your opportunity to discover the treasured stories revealed within James Smithson's crypt, the Gre...
# >>   },
# >>   [2] {
# >>     :title      => "Exhibition Interpreters/Navigators (throughout the day)",
# >>     :start_time => "2016-11-09T15:00:00Z",
# >>     :end_time   => "2016-11-09T15:00:00Z",
# >>     :notes      => "Museum volunteer interpreters welcome visitors, answer questions, and help visitors navigate exhibitions. Interpreters may be stationed in several of the following exhibitions at various times throughout the day, subject to volunteer interpreter availability. <ul> \t<li><em>The David H. Koch Hall of Human Origins: What Does it Mean to be Human?</em></li> \t<li><em>The Sant Ocean Hall</em></li> </ul>"
# >>   }
# >> ]

我没有尝试收集您要求的具体信息,因为event_name并不存在,而且一旦您理解了一些规则,您所做的事情就非常通用且容易完成

XML通常非常重复,因为它代表数据表。 &#34;细胞&#34;表格可能会有所不同,但您可以使用重复来帮助您。在此代码中

doc.search('entry')

遍历<entry>个节点。然后,可以很容易地查看它们以找到所需的信息。

XML使用命名空间来帮助避免标记名冲突。起初看起来真的很难,但是Nokogiri为文档提供了collect_namespaces方法,该方法返回文档中所有命名空间的哈希值。如果您正在寻找名称空间标记,请将该哈希作为第二个参数传递。

Nokogiri允许我们将XPath和CSS用于选择器。我几乎总是使用CSS来提高可读性。 ns|tag是告诉Nokogiri使用基于CSS的命名空间标记的格式。再次,传递文档中名称空间的哈希值,Nokogiri将完成其余的工作。

如果您熟悉与Nokogiri合作,您会发现上述代码与用于将<td>个单元格的内容拉到<tr>行内的常规代码非常相似HTML <table>

您应该能够修改该代码以收集所需的数据,而不会冒命名空间冲突的风险。

答案 1 :(得分:1)

提供的链接包含XML,因此您的XPath表达式应该与XML结构一起使用。

关键是该文档具有名称空间。据我所知,所有XPath表达式都应牢记这一点,并指定名称空间 为了简单地使用XPath表达式,可以使用remove_namespaces!方法:

require 'nokogiri'
require 'open-uri'
url = 'https://www.trumba.com/calendars/smithsonian-events.xml'
doc = Nokogiri::XML(open(url)); nil # nil is used to avoid huge output

doc.remove_namespaces!; nil
event = doc.xpath('//feed/entry[1]') # it will give you the first event

event.xpath('./title').text # => "Conservation Clinics"
event.xpath('./categories').text # => "Demonstrations,Lectures & Discussions"

您很可能想要拥有所有事件哈希的数组 你可以这样做:

doc.xpath('//feed/entry').reduce([]) do |memo, event|
  event_hash = {
    title: event.xpath('./title').text,
    categories: event.xpath('./categories').text
    # all other attributes you need ...
  }
  memo << event_hash
end

它将为您提供如下数组:

[
  {:title=>"Conservation Clinics", :categories=>"Demonstrations,Lectures & Discussions"}, 
  {:title=>"Castle Highlights Tour", :categories=>"Gallery Talks & Tours"}, 
  ...
]