解析包含xml的网页但会引发错误

时间:2014-01-03 10:50:54

标签: python xml

我正在尝试解析xml。需要标题,描述和pubdate。 我收到一个错误:

  for item in doc.findAll('rss/channel/item'):
AttributeError: 'str' object has no attribute 'findAll'

这是我的代码:

from bs4 import BeautifulSoup
import csv, sys
import urllib2
from xml.dom.minidom import parse, parseString

toursxml = 'http://www.tradingeconomics.com/rss/news.aspx'
toursurl= urllib2.urlopen(toursxml)
doc= toursurl.read()
#parseString( doc )
#print doc
data = []
cols = set()
for item in doc.findAll('rss/channel/item'):
    d = {}
    for sub in item:
        if hasattr(sub, 'name'):
            d[sub.name] = sub.text
    data.append(d)
    cols = cols.union(d.keys())

cw = csv.writer(sys.stdout)
cw.writerow(cols)
for row in data:
    cw.writerow([row.get(k, 'N/A') for k in cols])

1 个答案:

答案 0 :(得分:1)

您正尝试使用错误的工具解析RSS Feed。您的代码尝试使用BeautifulSoup方法而不实际创建BeautifulSoup对象,尝试将XPath表达式与不支持XPath的API一起使用,并尝试使用适用于HTML的库,而不是XML。

使用feedparser代替处理此类Feed:

import feedparser

feed = feedparser.parse('http://www.tradingeconomics.com/rss/news.aspx')

for item in feed.entries:
    print item.title, item.author

这会产生:

>>> import feedparser
>>> feed = feedparser.parse('http://www.tradingeconomics.com/rss/news.aspx')
>>> for item in feed.entries:
...     print item.title, item.author
... 
Latvia Retail Sales MoM Central Statistical Bureau of Latvia
China Foreign Exchange Reserves People's Bank of China
Latvia Retail Sales YoY Central Statistical Bureau of Latvia
Spain Business Confidence Ministry of Industry, Tourism and Trade, Spain
Italy Consumer Price Index (CPI) National Institute of Statistics (ISTAT)
Italy Inflation Rate National Institute of Statistics (ISTAT)
Cyprus Inflation Rate Statistical Service of the Republic of Cyprus
# .... and many more lines