如何获取ruby脚本的xml格式的RSS feed

时间:2013-12-17 19:43:43

标签: ruby xml dashing

我正在使用this dashing widget中的以下ruby脚本来检索RSS提要并解析它并将解析后的标题和描述发送到小部件。

require 'net/http'
require 'uri'
require 'nokogiri'
require 'htmlentities'

news_feeds = {
  "seattle-times" => "http://seattletimes.com/rss/home.xml",
}

Decoder = HTMLEntities.new

class News
  def initialize(widget_id, feed)
    @widget_id = widget_id
    # pick apart feed into domain and path
    uri = URI.parse(feed)
    @path = uri.path
    @http = Net::HTTP.new(uri.host)
  end

  def widget_id()
    @widget_id
  end

  def latest_headlines()
    response = @http.request(Net::HTTP::Get.new(@path))
    doc = Nokogiri::XML(response.body)
    news_headlines = [];
    doc.xpath('//channel/item').each do |news_item|
      title = clean_html( news_item.xpath('title').text )
      summary = clean_html( news_item.xpath('description').text )
      news_headlines.push({ title: title, description: summary })
    end
    news_headlines
  end

  def clean_html( html )
    html = html.gsub(/<\/?[^>]*>/, "")
    html = Decoder.decode( html )
    return html
  end

end

@News = []
news_feeds.each do |widget_id, feed|
  begin
    @News.push(News.new(widget_id, feed))
  rescue Exception => e
    puts e.to_s
  end
end

SCHEDULER.every '60m', :first_in => 0 do |job|
  @News.each do |news|
    headlines = news.latest_headlines()
    send_event(news.widget_id, { :headlines => headlines })
  end
end

示例rss feed正常工作,因为该URL用于xml文件。但是,我想将此用于不提供实际xml文件的其他RSS源。我想要的这个RSS Feed是http://www.ttc.ca/RSS/Service_Alerts/index.rss 这似乎没有在小部件上显示任何内容。我没有使用“http://www.ttc.ca/RSS/Service_Alerts/index.rss”,而是尝试了“http://www.ttc.ca/RSS/Service_Alerts/index.rss?format=xml”和“view-source:http://www.ttc.ca/RSS/Service_Alerts/index.rss”,但没有运气。有谁知道我如何获得与这个rss feed相关的实际xml数据,以便我可以将它与这个ruby脚本一起使用?

1 个答案:

答案 0 :(得分:2)

你是对的,该链接不提供常规XML,因此该脚本在解析它时不起作用,因为它是专门用于解析示例XML的。你试图解析的rss feed是提供RDF XML的,你可以使用Rubygem:RDFXML来解析它。

类似的东西:

require 'nokogiri'
require 'rdf/rdfxml'

rss_feed = 'http://www.ttc.ca/RSS/Service_Alerts/index.rss'

RDF::RDFXML::Reader.open(rss_feed) do |reader|
  # use reader to iterate over elements within the document
end

从这里,您可以尝试学习如何使用RDFXML来提取您想要的内容。我首先检查读者对象我可以使用的方法:

puts reader.methods.sort - Object.methods

这将打印出读者自己的方法,寻找可能用于您的目的的方法,例如reader.each_entry

要进一步挖掘,您可以检查每个条目的样子:

reader.each_entry do |entry|
  puts "----here's an entry----" 
  puts entry.inspect
end

或查看您可以在条目上调用的方法:

reader.each_entry do |entry|
  puts "----here's an entry's methods----" 
  puts entry.methods.sort - Object.methods
  break
end

我能够使用这个黑客工作粗略地找到一些标题和描述:

RDF::RDFXML::Reader.open('http://www.ttc.ca/RSS/Service_Alerts/index.rss') do |reader|
  reader.each_object do |object|
    puts object.to_s if object.is_a? RDF::Literal
  end
end

# returns:

# TTC Service Alerts
# http://www.ttc.ca/Service_Advisories/index.jsp

#      TTC Service Alerts.

# TTC.ca
# http://www.ttc.ca
# http://www.ttc.ca/images/ttc-main-logo.gif
# Service Advisory
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory

# 196 York University Rocket route diverting northbound via Sentinel, Finch due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 196 York University Rocket
# 2013-12-17T13:49:03.800-05:00
# Service Advisory (2)
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory+(2)

# 107B Keele North route diverting northbound via Keele, Lepage due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 107 Keele North
# 2013-12-17T13:51:08.347-05:00

但我无法快速找到一种方法来了解哪一个是标题,以及哪个描述:/

最后,如果您仍然无法找到如何提取所需内容,请使用此信息开始一个新问题。

祝你好运!