使用Nokogiri解析XML提要不起作用

时间:2013-01-21 11:31:49

标签: ruby nokogiri

这是我的代码:

doc= Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search=doc.css('item')
if !search.blank?
  search.each do |data|
    title=data.css("title").text

    link=data.css("link").text
  end
end

但我没有得到链接。

2 个答案:

答案 0 :(得分:0)

根据http://nokogiri.org/tutorials/searching_a_xml_html_document.html之类的内容:

@doc = Nokogiri::XML(File.read("feed.xml"))
@doc.xpath('//xmlns:link')

应该做的工作。但请注意,您提供的xml片段根本不是有效的xml源(没有根元素,项目标记未打开 - 仅关闭等)。该代码假定xml feed看起来就是这样。

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <item>
    <title>Atom-Powered Robots Run Amok</title>
    <link>http://example.org/2003/12/13/atom03</link>
  </item>
</feed>

提取物:

<link>http://example.org/2003/12/13/atom03</link>

结果。如果您遇到这样的问题,请先尝试查看文档/参考资料。如果您尝试了一些并且它没有像您期望的那样工作,那么您可以使用实际代码参考stackoverflow - 这样可以更容易地理解您的问题&amp;提供帮助。

答案 1 :(得分:0)

有些事情是错的:

if !search.blank?

将无效,因为search将是doc.css返回的NodeSet。 NodeSet没有blank?方法。也许你的意思是empty?

title=data.css("title").text

不是找到title的正确方法,因为就像上面的问题一样,你得到的是NodeSet而不是Node。从NodeSet获取text可能会返回大量您不想要的垃圾。而是做:

title=data.at("title").text

将代码更改为:

require 'nokogiri'
require 'open-uri'

doc= Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search=doc.css('item')
if !search.empty?
  search.each do |data|
    title=data.at("title").text
    link=data.at("link").text
    puts "title: #{ title } link: #{ link }"
  end
end

输出:

title: Ex-Bengals cheerleaders lawsuit trial to begin link:
title: Freedom Center Offering Free Admission Monday link:
title: Miami University Band Performing in the Inaugural Parade link:
title: Northern Kentucky Man To Present Colors At Inauguration link:
title: John Gumms Monday Forecast link:
title: President Obama VP Biden sworn in officially begin second terms link:
title: Colerain Township Pizza Hut Robbed Saturday Night link:
title: Cold Snap Coming to Tri-State link:
title: 2 Men Arrested After Police Chase in Northern Kentucky link:

link无效,因为XML格式不正确,根据我的经验,这种情况在互联网上难以置信,因为人们不会花时间检查他们的工作。

修复将在Nokogiri接收内容之前对XML进行按摩,或者修改您的访问者。幸运的是,这个特定的XML很容易解决,所以这应该有所帮助:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search = doc.css('item')
if !search.empty?
  search.each do |data|
    title = data.at("title").text
    link = data.at("link").next_sibling.text
    puts "title: #{ title } link: #{ link }"
  end
end

哪个输出:

title: Ex-Bengals cheerleaders lawsuit trial to begin link: http://www.cincinnatisun.com/index.php/sid/212072454/scat/90d24f4ad98a2793
title: Freedom Center Offering Free Admission Monday link: http://www.cincinnatisun.com/index.php/sid/212072914/scat/90d24f4ad98a2793
title: Miami University Band Performing in the Inaugural Parade link: http://www.cincinnatisun.com/index.php/sid/212072915/scat/90d24f4ad98a2793
title: Northern Kentucky Man To Present Colors At Inauguration link: http://www.cincinnatisun.com/index.php/sid/212072913/scat/90d24f4ad98a2793
title: John Gumms Monday Forecast link: http://www.cincinnatisun.com/index.php/sid/212070535/scat/90d24f4ad98a2793
title: President Obama VP Biden sworn in officially begin second terms link: http://www.cincinnatisun.com/index.php/sid/212060033/scat/90d24f4ad98a2793
title: Colerain Township Pizza Hut Robbed Saturday Night link: http://www.cincinnatisun.com/index.php/sid/212057132/scat/90d24f4ad98a2793
title: Cold Snap Coming to Tri-State link: http://www.cincinnatisun.com/index.php/sid/212057131/scat/90d24f4ad98a2793
title: 2 Men Arrested After Police Chase in Northern Kentucky link: http://www.cincinnatisun.com/index.php/sid/212057130/scat/90d24f4ad98a2793

完成所有这些后,您可以更清楚地编写代码,如:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
doc.css('item').each do |data|
  title = data.at("title").text
  link = data.at("link").next_sibling.text
  puts "title: #{ title } link: #{ link }"
end

有趣的是,现在示例页面似乎已修复其链接。