解析<item>标记</item>的XML RSS提要字节流

时间:2013-02-07 20:20:12

标签: python xml parsing rss beautifulsoup

我正在尝试为元素“”的第一个实例解析RSS提要。

def pageReader(url):
try:
    readPage = urllib2.urlopen(url)
except urllib2.URLError, e:
#   print 'We failed to reach a server.'
#   print 'Reason: ', e.reason
    return 404  
except urllib2.HTTPError, e:
#   print('The server couldn\'t fulfill the request.')
#   print('Error code: ', e.code)   
    return 404  
else:
    outputPage = readPage.read()        
return outputPage

假设传递的参数是正确的。该函数返回一个str对象,其值只是一个完整的rss feed - 我已经确认了类型:

a = isinstance(value, str)
if not a:
   return -1

所以,从函数调用中返回了一个完整的rss feed,这就是我碰到了一堵砖墙 - 我尝试用BeautifulSoup,lxml和其他各种lib解析feed,但没有成功(我有使用BeautifulSoup的一些成功,但它无法从父级中提取某些子元素,例如,。 我只是准备好编写我自己的解析器,但我想知道是否有人有任何建议。

要重新创建我的错误,只需使用类似于:

的参数调用上述函数

http://www.cert.org/nav/cert_announcements.rss

你会看到我正在努力让第一个孩子回归。

<item>
<title>New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)</title>
<link>http://www.cert.org/blogs/insider_threat/2013/02/common_sense_guide_to_mitigating_insider_threats_-_best_practice_16_of_19.html</link>
<description>This sixteenth of 19 blog posts about the fourth edition of the Common   Sense Guide to Mitigating Insider Threats describes Practice 16: Develop a formalized insider threat program.</description>
<pubDate>Wed, 06 Feb 2013 06:38:07 -0500</pubDate>
</item>

正如我所说,BeautifulSoup未能找到pubDate和Link,这对我的应用程序至关重要。

非常感谢任何建议。

1 个答案:

答案 0 :(得分:1)

我使用BeautifulStoneSoup并传递小写标签取得了一些成功:

from BeautifulSoup import BeautifulStoneSoup
xml = '<item><title>New Blog Entry: Common Sense Guide to Mitigating Insider Threats - Best Practice 16 (of 19)</title><link>http://www.cert.org/blogs/insider_threat/2013/02/common_sense_guide_to_mitigating_insider_threats_-_best_practice_16_of_19.html</link><description>This sixteenth of 19 blog posts about the fourth edition of the Common   Sense Guide to Mitigating Insider Threats describes Practice 16: Develop a formalized insider threat program.</description><pubDate>Wed, 06 Feb 2013 06:38:07 -0500</pubDate></item>'


soup = BeautifulStoneSoup(xml)
item = soup('item')[0]
print item('pubdate'), item('link')