解析XML以获取节点的值

时间:2012-08-03 07:10:26

标签: python xml

import xml.dom.minidom

content = """
<urlset xmlns="http://www.google.com/schemas/sitemap/0.90">
  <url>
    <loc>http://www.domain.com/</loc>
    <lastmod>2011-01-27T23:55:42+01:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>http://www.domain.com/page1.html</loc>
    <lastmod>2011-01-26T17:24:27+01:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.5</priority>
  </url>  
  <url>
    <loc>http://www.domain.com/page2.html</loc>
    <lastmod>2011-01-26T15:35:07+01:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.5</priority>
  </url>  
</urlset>
"""

xml = xml.dom.minidom.parseString(content)
urlset = xml.getElementsByTagName("urlset")[0]
url = urlset.getElementsByTagName("url")

for i in range(0, url.length):
    loc = url[i].getElementsByTagName("loc")[0].childNodes[0].nodeValue
    lastmod = url[i].getElementsByTagName("lastmod")[0].childNodes[0].nodeValue
    changefreq = url[i].getElementsByTagName("changefreq")[0].childNodes[0].nodeValue
    priority = url[i].getElementsByTagName("priority")[0].childNodes[0].nodeValue
    print "%s, %s, %s, %s" % (loc, lastmod, changefreq, priority)

是否有更简单的方法来获取节点的值?

loc = url[i].getElementsByTagName("loc")[0].childNodes[0].nodeValue

4 个答案:

答案 0 :(得分:0)

这是否有效:loc = getElementsByTagName("loc")[i].innerHTML

答案 1 :(得分:0)

可能有更好的方法来获得节点的价值......但这至少是一个更清洁的选择,你不会重复:

import xml.dom.minidom

content = """
<urlset xmlns="http://www.google.com/schemas/sitemap/0.90">
  <url>
    <loc>http://www.domain.com/</loc>
    <lastmod>2011-01-27T23:55:42+01:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>http://www.domain.com/page1.html</loc>
    <lastmod>2011-01-26T17:24:27+01:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.5</priority>
  </url>  
  <url>
    <loc>http://www.domain.com/page2.html</loc>
    <lastmod>2011-01-26T15:35:07+01:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.5</priority>
  </url>  
</urlset>
"""

def get_first_node_val(obj, tag):
    return obj.getElementsByTagName(tag)[0].childNodes[0].nodeValue

xml = xml.dom.minidom.parseString(content)
urlset = xml.getElementsByTagName("urlset")[0]
urls = urlset.getElementsByTagName("url")

for url in urls:
    loc = get_first_node_val(url, "loc")
    lastmod = get_first_node_val(url, "lastmod")
    changefreq = get_first_node_val(url, "changefreq")
    priority = get_first_node_val(url, "priority")
    print "%s, %s, %s, %s" % (loc, lastmod, changefreq, priority)

答案 2 :(得分:0)

为什么不是第一个节点

loc = url[i].getElementsByTagName("loc").firstChild.nodeValue

答案 3 :(得分:0)

为“get_first_node_val”添加附加功能,该功能接受具有相同节点值的XML元素。例如,以下包含两个loc元素。

<url>
<loc>http://domain.com/</loc>
<loc>http://sub.domain.com</loc>
<lastmod>2011-01-27T23:55:42+01:00</lastmod>
<changefreq>daily</changefreq>
<priority>0.5</priority>
</url>


def get_first_node_val(obj, tag):
  element = []
  l = 0
  for x in obj.getElementsByTagName(tag):
    element.append({tag : obj.getElementsByTagName(tag)[l].childNodes[0].nodeValue})
    l += 1
  return element

输出

[{'loc': u'http://domain.com/'}, {'loc': u'http://sub.domain.com'}], [{'lastmod': u'2011-01-27T23:55:42+01:00'}], [{'changefreq': u'daily'}], [{'priority': u'0.5'}]