迭代XML文件并从中提取数据

时间:2012-04-28 11:30:39

标签: python

我有这个XML文件:

<movie id = 0> 
  <Movie_name>The Shawshank Redemption   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0111161/
  </Address> 
  <year>1994  </year> 
  <stars>Tim Robbins  Morgan Freeman  Bob Gunton    </stars> 
  <plot> plot...
  </plot> 
  <keywords>Reviews, Showtimes</keywords>
</movie>

<movie id = 1> 
  <Movie_name>Inglourious Basterds   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0361748/
  </Address> 
  <year>2009  </year> 
  <stars>Brad Pitt  M&#xE9;lanie Laurent  Christoph Waltz    </stars> 
  <plot>plot/... 
  </plot> 
  <keywords>Reviews, credits  </keywords>
</movie>

如何迭代文件为每部电影提取其特定数据?我的意思是电影0:它的名字,地址,年份等等。

输入文件结构是必需的,因此可以在循环时完成数据提取。

非常感谢。

3 个答案:

答案 0 :(得分:3)

您需要查看xml.etree.ElementTree.

我还注意到你所拥有的不是有效的XML,所以你可能会遇到问题。有效的XML可能看起来更像这样:

<movie id="0"> 
  <name>The Shawshank Redemption</name> 
  <url>http://www.imdb.com/title/tt0111161/</url> 
  <year>1994</year> 
  <stars>
    <star>Tim Robbins</star>
    <star>Morgan Freeman</star>
    <star>Bob Gunton</star>
  </stars> 
  <plot>plot...</plot> 
  <keywords>
    <keyword>Reviews</keyword>
    <keyword>Showtimes</keyword>
  </keywords>
</movie>

请注意小写标记名称和属性(<movieNum = 0>没有意义)。您还需要一个顶部的XML声明(如<?xml version="1.0" encoding="UTF-8" ?>)。您可以在XML Validation或使用xmllint验证您的XML,例如。

一旦有了有效的XML,就可以解析它并使用iterparse()对其进行迭代,或者解析它然后迭代构造的元素树。

答案 1 :(得分:3)

编辑 - 接受改进的XML输入

我强烈建议您尝试在@Lattyware的评论中验证您的输入。我发现使用无效的XML和HTML,BeautifulSoup可以很好地恢复可用的东西。以下是快速尝试的功能:

from BeautifulSoup import BeautifulSoup

# Note: I have added the <movielist> root element
xml = """<movielist>
<movie id = 0> 
  <Movie_name>The Shawshank Redemption   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0111161/
  </Address> 
  <year>1994  </year> 
  <stars>Tim Robbins  Morgan Freeman  Bob Gunton    </stars> 
  <plot> plot...
  </plot> 
  <keywords>Reviews, Showtimes</keywords>
</movieNum>

<movie id = 1> 
  <Movie_name>Inglourious Basterds   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0361748/
  </Address> 
  <year>2009  </year> 
  <stars>Brad Pitt  M&#xE9;lanie Laurent  Christoph Waltz    </stars> 
  <plot>plot/... 
  </plot> 
  <keywords>Reviews, credits  </keywords>
</movieNum>

</movielist>"""

soup = BeautifulSoup(xml)
movies = soup.findAll('movie')

for movie in movies:
    id_tag = movie['id']
    name = movie.find("movie_name").text
    url = movie.find("address").text
    year = movie.find("year").text
    stars = movie.find("stars").text
    plot = movie.find("plot").text
    keywords = movie.find("keywords").text
    for item in (id_tag, name, url, year, stars, plot, keywords):
        print item
    print '=' * 50

这将输出以下内容(ID标签现在可以访问):

0
The Shawshank Redemption
http://www.imdb.com/title/tt0111161/
1994
Tim Robbins  Morgan Freeman  Bob Gunton
plot...
Reviews, Showtimes
==================================================
1
Inglourious Basterds
http://www.imdb.com/title/tt0361748/
2009
Brad Pitt  M&#xE9;lanie Laurent  Christoph Waltz
plot/...
Reviews, credits
==================================================

希望能给你一个开始......它只会从这里变得更好。

答案 2 :(得分:2)

BeutifulSoup更宽容,它也可以用于HTML(其中一些封闭标签是可选的)。仅当XML有效时才能使用ElementTree。您可以通过将片段包装到单个元素来使其部分有效。属性值必须用引号括起来。尝试以下方法,其中创建Movie类以从一个电影元素捕获信息。该类源自dict,与dict一样灵活;但是,您可以创建自己的方法以从收集的信息中返回已处理的值:

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET

class Movie(dict):

    def __init__(self, movie_element):
        assert movie_element.tag == 'movie'  # we are able to process only that
        self['id'] = movie_element.attrib['id']  
        for e in movie_element:
            self[e.tag] = e.text.strip()

    def name(self):
        return self['Movie_name']

    def url(self):
        return self['Address']

    def year(self):
        return self['year']     

    def stars(self):
        return self['stars']

    def plot(self):
        return self['plot']

    def keywords(self):
        return self['keywords']

    def __str__(self):
        lst = []
        lst.append(self.name() + ' (' + self.year() + ')')
        lst.append(self.stars())
        lst.append(self.url())
        return '\n'.join(lst)


fragment = '''\
<movie id = "0"> 
  <Movie_name>The Shawshank Redemption   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0111161/
  </Address> 
  <year>1994  </year> 
  <stars>Tim Robbins  Morgan Freeman  Bob Gunton    </stars> 
  <plot> plot...
  </plot> 
  <keywords>Reviews, Showtimes</keywords>
</movie>

<movie id = "1"> 
  <Movie_name>Inglourious Basterds   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0361748/
  </Address> 
  <year>2009  </year> 
  <stars>Brad Pitt  Melanie Laurent  Christoph Waltz    </stars> 
  <plot>plot/... 
  </plot> 
  <keywords>Reviews, credits  </keywords>
</movie>
'''

fixed_fragment = '<root>\n' + fragment + '</root>'
##print fixed_fragment

tree = ET.fromstring(fixed_fragment)
movies = []
for m in tree:
    movies.append(Movie(m))

for movie in movies:
    print '\n------------------'
    print movie    

它在我的控制台上打印:

------------------
The Shawshank Redemption (1994)
Tim Robbins  Morgan Freeman  Bob Gunton
http://www.imdb.com/title/tt0111161/

------------------
Inglourious Basterds (2009)
Brad Pitt  Melanie Laurent  Christoph Waltz
http://www.imdb.com/title/tt0361748/

请注意,我已经替换了非ASCII字符 - 编码问题需要单独解决。