Question

我正在尝试从https://feeds.finance.yahoo.com/rss/2.0/headline?s=goog&region=US&lang=en-US的xml文件中提取所有item元素，以便我可以访问每个元素的title和link然后可以执行一些其他功能。

xml具有以下结构：

<rss>
    <channel>
    <title> </title>
    <copyright></copyright>
    <link></link>
    <description></description>
    <language></language>
    <lastBuildDate></lastBuildDate>
    <image>
    <url></url>
    <title></title>
    <link></link>
    <width></width>
    <height></height>
    </image>
    <item>
        <title></title>
        <link></link>
        <description></description>
        <guid></guid>
        <pubDate></pubDate>
    </item>
    </channel>
</rss>

我写了以下代码：

import urllib
from xml.etree import ElementTree


class News():

    base_url = 'http://finance.yahoo.com/rss/headline?s='
    query = 'goog'

    url = base_url + query
    response = urllib.urlopen(url)
    data = response.read()

    dom = ElementTree.fromstring(data)
    items = dom.findall('channel/item/')


    for item in items:
        print item.text

它输出<channel>元素中的每个元素，例如

Google funds 128 news projects in Europe
http://us.rd.yahoo.com/finance/news/rss/story/*http://sg.finance.yahoo.com/news/google-funds-128-news-projects-211927426.html
None
yahoo_finance/2067775856
Wed, 24 Feb 2016 21:19:27 GMT

但是，我无法弄清楚如何访问<item>元素中的元素。我尝试了以下代码：

for item in items:

        title = item.find('title')
        print title.text

但是我收到以下错误AttributeError: 'NoneType' object has no attribute 'text'

如何访问title元素中的link和item元素？感谢

Answer 1

在dom.findAll（'channel / item'）中删除斜杠就可以了。示例代码只输出标题

import urllib
from xml.etree import ElementTree


class News():

    base_url = 'http://finance.yahoo.com/rss/headline?s='
    query = 'goog'

    url = base_url + query
    response = urllib.urlopen(url)
    data = response.read()

    dom = ElementTree.fromstring(data)
    items = dom.findall('channel/item')


    for item in items:
        print(item.find('title').text)

输出只是标题：

Google launches 'Accelerated Mobile Pages' feature in India
The Death of Oscar Trivia
Meet Atlas, Boston Dynamics' New Humanoid Robot
[$$] Business Watch
Google Fiber Heads To San Francisco; Faster Search Service Coming
U.S. Justice Dept., Silicon Valley discuss online extremism
Google Fiber to Expand to Tech Hub
Behind Google's Deepmind Healthcare App
Google Renews Push for ‘Fair Use’ of APIs Before Oracle Trial
Forget Keyboards: We Dictated This Story on Google Docs
U.S. aviation regulator starts rule-making process for public drone flights
Android N could stand for No App Drawer: Why that's an epic mistake
Google is putting its video streaming gadget directly inside TVs
These Google Maps glitches are the stuff of nightmares
Google launches AMP for faster web page loading
Microsoft to buy app-development startup Xamarin
Will Users Like Facebook’s New Selection of ‘Reactions?’ — Tech Roundup
France Says Google Owes 1.6 Billion Euros in Back Taxes
Google speeds news to smartphones, challenging Facebook
Google funds 128 news projects in Europe

Answer 2

你可以试试这个：

root = lxml.fromstring(data)
results = root.findall('channel/item/')
texts = [r.find('title').text for r in results]

Answer 3

使用lxml / XPath的方法略有不同：

import requests
import lxml.etree

r = requests.get('https://feeds.finance.yahoo.com/rss/2.0/headline?s=goog&region=US&lang=en-US')
tree = lxml.etree.fromstring(r.content)

items = tree.xpath('//item')

def fst(i):
    if i: return i[0]
    else: return ''

data = []
for i in items:
    entry = {
        'title'   : fst(i.xpath('title/text()')),
        'link'    : fst(i.xpath('link/text()')),
        'guid'    : fst(i.xpath('guid/text()')),
        'pubDate' : fst(i.xpath('pubDate/text()')),
        'description' : fst(i.xpath('description/text()')),
    }
    data.append(entry)

for entry in data:
    print entry['title']

无法从xml中提取item元素

3 个答案: