如何解析XML元素并使用Python 2.7获取值

时间:2016-09-06 22:52:29

标签: python xml xml-parsing

API响应:http://iss.ndl.go.jp/api/opensearch?isbn=9784334770051 您好,感谢您昨天的帮助。 但是,当我尝试从Elements获取值时,我总是将空值作为响应。 我被评为link但不确定我理解它。 我哪里错了,有空值?

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    import codecs
    import sys
    import urllib
    import urllib2
    import re, pprint
    from xml.etree.ElementTree import *
    import csv
    from xml.dom import minidom
    import xml.etree.ElementTree as ET
    import shelve
    import subprocess

    errorCheck = "0"
    isbn = raw_input("Enter IBSN Number Please ")
    isIsbn = len(isbn)

    # ElementTree requires namespace definition to work with XML with namespaces correctly
    # It is hardcoded at this point, but this should be constructed from response.
    namespaces = {
      'dc': 'http://purl.org/dc/elements/1.1/',
      'dcndl': 'http://ndl.go.jp/dcndl/terms/',
    }

    # for prefix, uri in namespaces.iteritems():
        # ElementTree.register_namespace(prefix, uri)

    if isIsbn == 10 or isIsbn == 13:
        errorCheck = 1
        url = "http://iss.ndl.go.jp/api/opensearch?isbn=%s" % isbn
        req = urllib2.Request(url)
        response = urllib2.urlopen(req)
        tree = ET.parse(response)
        root = tree.getroot()
        # root = ET.fromstring(XmlData) 
        print root.findall('dc:title', namespaces)
        print root.findall('dc:title')
        print root.findall('dc:identifier', namespaces)
        print root.findall('dc:identifier')
        print root.findall('identifier')

    if errorCheck == "0":
        print "It is not ISBN"

        # print(root.tag,root.attrib)    

        # for child in root.find('.//item'):
        # print child.text

1 个答案:

答案 0 :(得分:0)

您的代码需要稍加修改,在 findall 调用中将.//添加到您的表达式,根节点是 rss 节点和 dc:title' s 的后代不是 rss 节点的直接子节点,因此您需要搜索doc:

import xml.etree.ElementTree as ET
import requests

url = "http://iss.ndl.go.jp/api/opensearch?isbn=9784334770051"
tree = ET.fromstring(requests.get(url).content)
namespaces = {
    'dc': 'http://purl.org/dc/elements/1.1/',
    'dcndl': 'http://ndl.go.jp/dcndl/terms/',
}
[t.text for t in tree.findall('.//dc:title', namespaces)]
[i.text for i in tree.findall('.//dc:identifier', namespaces)]

您可以使用 lxml 轻松完成,它可以为您映射命名空间并获取源代码:

In [1]: import lxml.etree as et

In [2]: url = "http://iss.ndl.go.jp/api/opensearch?isbn=9784334770051"

In [3]: tree = et.parse(url)

In [4]: nsmap = tree.getroot().nsmap

In [5]: print(tree.xpath("//dc:title/text()", namespaces=nsmap))
[u'\u9244\u8155\u30a2\u30c8\u30e0']

In [6]: print(tree.xpath("//dc:identifier/text()", namespaces=nsmap))
['4334770053', '95078560']

您可以看到其中一个dc:titles:

的路径
In [55]: tree
Out[55]: <Element 'rss' at 0x7f996e8b66d0> # root

In [56]: tree.findall('channel') # child of root so don't need .//
Out[56]: [<Element 'channel' at 0x7f996e131990>]

In [57]: tree.findall('channel/item/dc:title', namespaces) # item is a descendant of rss, item is parent of the dc:title
Out[57]: [<Element '{http://purl.org/dc/elements/1.1/}title' at 0x7f996e131910>]

与标识符相同:

In [58]: tree.findall('channel//item//dc:identifier', namespaces)
Out[58]: 
[<Element '{http://purl.org/dc/elements/1.1/}identifier' at 0x7f996e131c50>,
 <Element '{http://purl.org/dc/elements/1.1/}identifier' at 0x7f996e131250>]