Question

如何获取网站的确切feed.xml / rss.xml / atom.xml路径？

例如，我提供了“http://www.example.com/news/today/this_is_a_news”，但rss指向“http://www.example.com/rss/feed.xml”，大多数现代浏览器已经具备此功能，我很好奇他们是如何获得这些功能的。

你能举出ruby，python或bash中的示例代码吗？

Answer 1

Ruby中的这样的东西会起作用......

require 'rubygems'
require 'nokogiri'
require 'open-uri'

html = Nokogiri::HTML(open('http://stackoverflow.com/questions/2441954/how-to-find-out-the-exact-rss-xml-path-of-a-website'))
puts html.css('link[type="application/atom+xml"]').first.attr('href')
#  => "/feeds/question/2441954"

请注意，这是一个绝对的URL路径，这是合法的，因此您需要预先添加主机信息。

此外，“application / atom + xml”也可以是“application / rss + xml”或“application / rdf + xml”，并且可以在页面中找到多个链接，因此您需要决定如何处理倍数。根据自动发现文档，第一个提供的应该是首选的，但从我见过的经验来看。此外，根据文档，链接不应该是备用数据类型（RSS和ATOM指向相同的内容），但应该是不同的内容，但同样，我已经看到了这种情况。

Answer 2

您也可以使用xmlstarlet等命令行工具（与HTML整理一起使用）：

# version 1
curl -s http://stackoverflow.com/questions/2441954/how-to-find-out-the-exact-rss-xml-path-of-a-website | 
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -T -t -m "//*[local-name()='link']" --if "@type='application/atom+xml' or @type='application/rss+xml'" -m "@href" -v '.' -n

# version 2
curl -s http://stackoverflow.com/questions/2441954/how-to-find-out-the-exact-rss-xml-path-of-a-website | 
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:link[@type='application/atom+xml' or @type='application/rss+xml']" -v "@href" -n

Answer 3

在python中使用这个经典的解决方案：http://www.aaronsw.com/2002/feedfinder/

如何找出网站的确切RSS XML路径？

3 个答案: