我最近开始研究python。我试图解析一个xml文档。请考虑以下xml文件以供参考:
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
</catalog>
这里我想要检索第一个book
标签及其所有内容,即
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
我来自scala背景,我可以轻松地使用
执行此操作val node = scala.xml.XML.loadString(str)
val nodeSeq = node \\ "book"
nodeSeq.head.toString()
我已尝试使用lxml
和xpath
执行此操作,但它变得复杂(获取嵌套元素的递归内容)以达到上述要求。有没有简单的方法在python中执行此操作?也可以扩展为html吗?
TIA
答案 0 :(得分:1)
使用lxml
和xpath
from lxml import etree
data = """<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
</catalog>"""
tree = etree.fromstring(data)
book = tree.xpath("//catalog/book") #or book = tree.xpath("(//catalog/book)[1]")
for i in book[0]:#[0] means first book
print etree.tostring(i)
输出 -
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
答案 1 :(得分:0)
这是仅提取第一本书的XPath:
//catalog/book[1]
这是返回所需结果的完整代码:
from lxml import html
XML = """<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
</catalog>"""
tree = html.fromstring(XML)
first_book = tree.xpath('//catalog/book[1]')[0]
book_id = first_book.xpath('@id')[0]
author = first_book.xpath('.//author/text()')[0]
title = first_book.xpath('.//title/text()')[0]
genre = first_book.xpath('.//genre/text()')[0]
price = first_book.xpath('.//price/text()')[0]
publish_date = first_book.xpath('.//publish_date/text()')[0]
description = first_book.xpath('.//description/text()')[0].replace('\n',' ').replace(' ','')
print """Book Id:\t\t{}
Author:\t\t\t{}
Title:\t\t\t{}
Genre:\t\t\t{}
Price:\t\t\t{}
Publish Date:\t{}
Description:\t{}""".format(book_id,author,title,genre,price,publish_date,description)
<强>输出:强>
Book Id: bk101
Author: Gambardella, Matthew
Title: XML Developer's Guide
Genre: Computer
Price: 44.95
Publish Date: 2000-10-01
Description: An in-depth look at creating applications with XML.
如果您需要从
every
内的<catalog>
本书中获取相同的信息,则只需将//catalog/book[1]
更改为//catalog/book
,然后将结果循环至提取每本书的字段数据。