如何在Python中获取没有根节点的XML

时间:2011-09-14 09:10:08

标签: python xml xml-parsing

鉴于以下数据:

<rdf:RDF
    xmlns="http://purl.org/rss/1.0/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
    xmlns:dc="http://purl.org/dc/elements/1.
<channel rdf:about="http://www.gmanews.tv/">
        <title>GMANews.TV</title>
        <description> GMA News.tv bring you the latest news from GMA News teams and highlights of your favorite shows. Subscribe now and stay up-to-date with GMA News.tv.</description>
        <link>http://www.gmanews.tv/</link>
</channel>

<item rdf:about="http://www.gmanews.tv/story/232365/world/magnitude-59-quake-hits-chilean-coast-no-damage">
        <dc:format>text/html</dc:format>
        <dc:date>2011-09-14T16:39:22+08:00</dc:date>
        <dc:source>http://www.gmanews.tv/story/232365/world/magnitude-59-quake-hits-chilean-coast-no-damage </dc:source>
                <title><![CDATA[Magnitude-5.9 quake hits Chilean coast, no damage]]></title>
        <link>http://www.gmanews.tv/story/232365/world/magnitude-59-quake-hits-chilean-coast-no-damage </link>
        <description><![CDATA[SANTIAGO - A magnitude 5.9 quake hit just off the coast of central Chile early on Wednesday, but the state emergency office said there were no reports of damage.]]></description>
    </item>
        <item rdf:about="http://www.gmanews.tv/story/232362/nation/house-minority-blames-pnoys-advisers-for-legal-setbacks">
        <dc:format>text/html</dc:format>
        <dc:date>2011-09-14T16:04:51+08:00</dc:date>
        <dc:source>http://www.gmanews.tv/story/232362/nation/house-minority-blames-pnoys-advisers-for-legal-setbacks </dc:source>
                <title><![CDATA[House minority blames PNoy's advisers for legal 'setbacks']]></title>
        <link>http://www.gmanews.tv/story/232362/nation/house-minority-blames-pnoys-advisers-for-legal-setbacks </link>
        <description><![CDATA[Members of the opposition at the House of Representatives on Wednesday blamed President Benigno Aquino III's advisers for the various legal "setbacks&quot; suffered by his administration and advised him to consider replacing some of his advisers.]]></description>
    </item>
        <item rdf:about="http://www.gmanews.tv/story/232356/nation/ex-sharia-judge-20-others-may-testify-in-poll-fraud-probe">
        <dc:format>text/html</dc:format>
        <dc:date>2011-09-14T15:19:45+08:00</dc:date>
        <dc:source>http://www.gmanews.tv/story/232356/nation/ex-sharia-judge-20-others-may-testify-in-poll-fraud-probe </dc:source>
                <title><![CDATA[Ex-Shari'a judge, 20 others may testify in poll fraud probe]]></title>
        <link>http://www.gmanews.tv/story/232356/nation/ex-sharia-judge-20-others-may-testify-in-poll-fraud-probe </link>
        <description><![CDATA[The former Shari'a court judge who claimed to have helped Gloria Macapagal-Arroyo cheat in the 2004 presidential elections and at least 20 others may serve as witnesses in the joint investigation by the Commission on Elections and Department of Justice on the alleged poll fraud, Comelec chief Sixto Brillantes Jr. said Wednesday.]]></description>
    </item>
</rdf:RDF>

现在我想了解<item>标签内所有元素的详细信息。这是微不足道的,但我是python的新手。我不太确定我将如何解析rdf然后提取内部的所有<item>

修改 我不能使用任何第三方库,因为我的脚本将在嵌入式系统上运行。

2 个答案:

答案 0 :(得分:2)

lxml提供了处理XML所有内容的好方法。您发布的XML示例:

from lxml import etree

document = etree.parse('your-example-xml.rdf')
root = document.getroot()

# Namespace shortcuts
ns = root.nsmap.get(None)
rdf = root.nsmap.get('rdf')

for item in root.xpath('purl:item', namespaces={'purl': ns}):
    print item.attrib.get('{%s}about' % rdf)
    print item.xpath('purl:description/text()', namespaces={'purl': ns})
    print

但是,如果它只是解析RDF,则可能存在可用的RDF特定库。

答案 1 :(得分:1)

由于第三方库不是一个选项,所以与Python's built-in ElementTree完成的代码相同:

from xml.etree import ElementTree as etree

document = etree.parse(open('your-example-xml.rdf'))
root = document.getroot()

ns_purl = 'http://purl.org/rss/1.0/'
ns_rdf = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'

for item in root.findall('{%s}item' % ns_purl):
    print item.attrib.get('{%s}about' % ns_rdf)
    print item.find('{%s}description' % ns_purl).text
    print