我有这个XML文件:
<movie id = 0>
<Movie_name>The Shawshank Redemption </Movie_name>
<Address>http://www.imdb.com/title/tt0111161/
</Address>
<year>1994 </year>
<stars>Tim Robbins Morgan Freeman Bob Gunton </stars>
<plot> plot...
</plot>
<keywords>Reviews, Showtimes</keywords>
</movie>
<movie id = 1>
<Movie_name>Inglourious Basterds </Movie_name>
<Address>http://www.imdb.com/title/tt0361748/
</Address>
<year>2009 </year>
<stars>Brad Pitt Mélanie Laurent Christoph Waltz </stars>
<plot>plot/...
</plot>
<keywords>Reviews, credits </keywords>
</movie>
如何迭代文件为每部电影提取其特定数据?我的意思是电影0:它的名字,地址,年份等等。
输入文件结构是必需的,因此可以在循环时完成数据提取。
非常感谢。
答案 0 :(得分:3)
我还注意到你所拥有的不是有效的XML,所以你可能会遇到问题。有效的XML可能看起来更像这样:
<movie id="0">
<name>The Shawshank Redemption</name>
<url>http://www.imdb.com/title/tt0111161/</url>
<year>1994</year>
<stars>
<star>Tim Robbins</star>
<star>Morgan Freeman</star>
<star>Bob Gunton</star>
</stars>
<plot>plot...</plot>
<keywords>
<keyword>Reviews</keyword>
<keyword>Showtimes</keyword>
</keywords>
</movie>
请注意小写标记名称和属性(<movieNum = 0>
没有意义)。您还需要一个顶部的XML声明(如<?xml version="1.0" encoding="UTF-8" ?>
)。您可以在XML Validation或使用xmllint验证您的XML,例如。
一旦有了有效的XML,就可以解析它并使用iterparse()
对其进行迭代,或者解析它然后迭代构造的元素树。
答案 1 :(得分:3)
编辑 - 接受改进的XML输入
我强烈建议您尝试在@Lattyware的评论中验证您的输入。我发现使用无效的XML和HTML,BeautifulSoup可以很好地恢复可用的东西。以下是快速尝试的功能:
from BeautifulSoup import BeautifulSoup
# Note: I have added the <movielist> root element
xml = """<movielist>
<movie id = 0>
<Movie_name>The Shawshank Redemption </Movie_name>
<Address>http://www.imdb.com/title/tt0111161/
</Address>
<year>1994 </year>
<stars>Tim Robbins Morgan Freeman Bob Gunton </stars>
<plot> plot...
</plot>
<keywords>Reviews, Showtimes</keywords>
</movieNum>
<movie id = 1>
<Movie_name>Inglourious Basterds </Movie_name>
<Address>http://www.imdb.com/title/tt0361748/
</Address>
<year>2009 </year>
<stars>Brad Pitt Mélanie Laurent Christoph Waltz </stars>
<plot>plot/...
</plot>
<keywords>Reviews, credits </keywords>
</movieNum>
</movielist>"""
soup = BeautifulSoup(xml)
movies = soup.findAll('movie')
for movie in movies:
id_tag = movie['id']
name = movie.find("movie_name").text
url = movie.find("address").text
year = movie.find("year").text
stars = movie.find("stars").text
plot = movie.find("plot").text
keywords = movie.find("keywords").text
for item in (id_tag, name, url, year, stars, plot, keywords):
print item
print '=' * 50
这将输出以下内容(ID标签现在可以访问):
0
The Shawshank Redemption
http://www.imdb.com/title/tt0111161/
1994
Tim Robbins Morgan Freeman Bob Gunton
plot...
Reviews, Showtimes
==================================================
1
Inglourious Basterds
http://www.imdb.com/title/tt0361748/
2009
Brad Pitt Mélanie Laurent Christoph Waltz
plot/...
Reviews, credits
==================================================
希望能给你一个开始......它只会从这里变得更好。
答案 2 :(得分:2)
BeutifulSoup更宽容,它也可以用于HTML(其中一些封闭标签是可选的)。仅当XML有效时才能使用ElementTree。您可以通过将片段包装到单个元素来使其部分有效。属性值必须用引号括起来。尝试以下方法,其中创建Movie
类以从一个电影元素捕获信息。该类源自dict,与dict一样灵活;但是,您可以创建自己的方法以从收集的信息中返回已处理的值:
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
class Movie(dict):
def __init__(self, movie_element):
assert movie_element.tag == 'movie' # we are able to process only that
self['id'] = movie_element.attrib['id']
for e in movie_element:
self[e.tag] = e.text.strip()
def name(self):
return self['Movie_name']
def url(self):
return self['Address']
def year(self):
return self['year']
def stars(self):
return self['stars']
def plot(self):
return self['plot']
def keywords(self):
return self['keywords']
def __str__(self):
lst = []
lst.append(self.name() + ' (' + self.year() + ')')
lst.append(self.stars())
lst.append(self.url())
return '\n'.join(lst)
fragment = '''\
<movie id = "0">
<Movie_name>The Shawshank Redemption </Movie_name>
<Address>http://www.imdb.com/title/tt0111161/
</Address>
<year>1994 </year>
<stars>Tim Robbins Morgan Freeman Bob Gunton </stars>
<plot> plot...
</plot>
<keywords>Reviews, Showtimes</keywords>
</movie>
<movie id = "1">
<Movie_name>Inglourious Basterds </Movie_name>
<Address>http://www.imdb.com/title/tt0361748/
</Address>
<year>2009 </year>
<stars>Brad Pitt Melanie Laurent Christoph Waltz </stars>
<plot>plot/...
</plot>
<keywords>Reviews, credits </keywords>
</movie>
'''
fixed_fragment = '<root>\n' + fragment + '</root>'
##print fixed_fragment
tree = ET.fromstring(fixed_fragment)
movies = []
for m in tree:
movies.append(Movie(m))
for movie in movies:
print '\n------------------'
print movie
它在我的控制台上打印:
------------------
The Shawshank Redemption (1994)
Tim Robbins Morgan Freeman Bob Gunton
http://www.imdb.com/title/tt0111161/
------------------
Inglourious Basterds (2009)
Brad Pitt Melanie Laurent Christoph Waltz
http://www.imdb.com/title/tt0361748/
请注意,我已经替换了非ASCII字符 - 编码问题需要单独解决。