我想从百度(DB2312编码)http://news.baidu.com/n?cmd=1&class=civilnews&tn=rss
解析xml提要我总是收到错误
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 3, column 8
如果我将xml更改为Google Feed http://news.google.com/news?cf=all&ned=us&hl=en&topic=b&output=rss,则可以正常使用。有什么建议吗?
def get_feeds():
import sys
import xml.etree.ElementTree as etree
from urllib import urlopen
URL = "http://news.baidu.com/n?cmd=1&class=civilnews&tn=rss"
#URL = "http://news.google.com/news?cf=all&ned=us&hl=en&topic=b&output=rss"
tree = etree.parse(urlopen(URL))
if __name__ == '__main__':
get_feeds()
答案 0 :(得分:0)
使用优秀的feedparser
library,解析该网址没有问题:
>>> import feedparser
>>> feed = feedparser.parse('http://news.baidu.com/n?cmd=1&class=civilnews&tn=rss')
>>> print feed['feed']['title']
百度国内焦点新闻
>>> len(feed['entries'])
20
>>> print feed['entries'][0]['title']
强台风“天兔”正逐渐接近台湾陆地