Question

我在python中解析XML feed以提取某些标签。我的XML包含名称空间，这导致每个标记包含一个名称空间，后跟标记名称。

这是xml：

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:rte="http://www.rte.ie/schemas/vod">
    <id>10038711/</id>
    <updated>2013-01-24T22:52:43+00:00</updated>
    <title type="text">Reeling in the Years</title>
    <logo>http://www.rte.ie/iptv/images/logo.gif</logo>
    <link rel="self" type="application/atom+xml" href="http://feeds.rasset.ie/rteavgen/player/playlist?type=iptv&amp;showId=10038711" />
    <category term="feed"/>
    <author>
        <name>RTE</name>
        <uri>http://www.rte.ie</uri>
    </author>
    <entry>
        <id>10038711</id>
        <published>2012-07-04T12:00:00+01:00</published>
        <updated>2013-01-06T12:31:25+00:00</updated>
        <title type="text">Reeling in the Years</title>
        <content type="text">National and international events with popular music from the year 1989.First Broadcast: 08/11/1999</content>
        <category term="WEB Exclusive" rte:type="channel"/>
        <category term="Classics 1980" rte:type="genre"/>
        <category term="rte player" rte:type="source"/>
        <category term="" rte:type="transmision_details"/>
        <category term="False" rte:type="copyprotectionoptout"/>
        <category term="long" rte:type="form"/>
        <category term="3275" rte:type="progid"/>
        <link rel="site" type="text/html" href="http://www.rte.ie/tv50/"/>
        <link rel="self" type="application/atom+xml" href="http://feeds.rasset.ie/rteavgen/player/playlist/?itemId=10038711&amp;type=iptv&amp;format=xml" />
        <link rel="alternate" type="text/html" href="http://www.rte.ie/player/#v=10038711"/>
        <rte:valid start="2012-07-23T15:56:04+01:00" end="2017-08-01T15:56:04+01:00"/>
        <rte:duration ms="842205" formatted="0:10"/>
        <rte:statistics views="19"/>
        <rte:bri id="na"/>
        <rte:channel id="13"/>
        <rte:item id="10038711"/>
        <media:title type="plain">Reeling in the Years</media:title>
        <media:description type="plain">National and international events with popular music from the year 1989. First Broadcast: 08/11/1999</media:description>
        <media:thumbnail url="http://img.rasset.ie/00062efc200.jpg" height="288" width="512" time="00:00:00+00:00"/>
        <media:teaserimgref1x1 url="" time="00:00:00+00:00"/>
        <media:rating scheme="http://www.rte.ie/schemes/vod">NA</media:rating>
        <media:copyright>RTÉ</media:copyright>
        <media:group rte:format="single">
            <media:content url="http://vod.hds.rasset.ie/manifest/2012/0728/20120728_reelingint_cl10038711_10039316_260_.f4m" type="video/mp4" medium="video" expression="full" duration="842205" rte:format="content"/>
        </media:group>
        <rte:ads>
            <media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&amp;iu=%2F3014%2FP_RTE_TV50_Pre&amp;ciu_szs=300x250&amp;impl=s&amp;gdfp_req=1&amp;env=vp&amp;output=xml_vast2&amp;unviewed_position_start=1&amp;url=[referrer_url]&amp;correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
            <media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&amp;iu=%2F3014%2FP_RTE_TV50_Pre2&amp;ciu_szs=300x250&amp;impl=s&amp;gdfp_req=1&amp;env=vp&amp;output=xml_vast2&amp;unviewed_position_start=1&amp;url=[referrer_url]&amp;correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
            <media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&amp;iu=%2F3014%2FP_RTE_TV50_Pre3&amp;ciu_szs=300x250&amp;impl=s&amp;gdfp_req=1&amp;env=vp&amp;output=xml_vast2&amp;unviewed_position_start=1&amp;url=[referrer_url]&amp;correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
        </rte:ads>
    </entry>
<!-- playlist.xml -->
</feed>

解析XML时，每个元素都会显示为：

{http://www.w3.org/2005/Atom}id
{http://www.w3.org/2005/Atom}published
{http://www.w3.org/2005/Atom}updated
.....
.....
{http://www.rte.ie/schemas/vod}valid
{http://www.rte.ie/schemas/vod}duration
....
....
{http://search.yahoo.com/mrss/}description
{http://search.yahoo.com/mrss/}thumbnail
....

由于我有3个不同的命名空间，我无法保证它们总是一样的，所以我不愿意像这样严格指定每个标签：

for elem in tree.iter({http://www.w3.org/2005/Atom}entry'):
    stream = str(elem.find('{http://www.w3.org/2005/Atom}id').text)
    date_tmp = str(elem.find('{http://www.w3.org/2005/Atom}published').text)
    name_tmp = str(elem.find('{http://www.w3.org/2005/Atom}title').text)
    short_tmp = str(elem.find('{http://www.w3.org/2005/Atom}content').text)
    channel_tmp = elem.find('{http://www.w3.org/2005/Atom}category', "channel")
    channel = str(channel_tmp.get('term'))
    icon_tmp = elem.find('{http://search.yahoo.com/mrss/}thumbnail')
    icon_url = str(icon_tmp.get('url'))

有没有什么方法可以将通配符或类似内容放入查找中，这样它就会忽略命名空间？

stream = str(elem.find('*id').text)

我可以像上面那样对它们进行硬编码，但是我的运气就是命名空间会改变，我的查询会停止返回数据。

感谢您的帮助。

Answer 1

您可以将XPath表达式与local-name（）函数一起使用：

<?xml version="1.0"?>
<root xmlns="ns">
  <tag/>
</root>

假设“doc”是上述XML的ElementTree：

import lxml.etree
doc = lxml.etree.parse(<some_file_like_object>)
root = doc.getroot()
root.xpath('//*[local-name()="tag"]')
[<Element {ns}tag at 0x7fcde6f7c960>]

根据需要替换<some_file_like_object>（或者，您可以使用lxml.etree.fromstring和XML字符串直接获取root元素。

Python ElementTree使用通配符find（）？

1 个答案: