我的大型XML字符串示例如下:
<RecordContainer RecordNumber = "1">
<catalog>
<Person id="bk101">
<person>
<author>Gambardella, Matthew</author>
<personal_info>
<age>40</age>
<DOB>19-02-1988</DOB>
</personal_info>
</person>
<books>
<book id>1</book id>
<title>XML Developer's Guide</title>
<price>44.95</price>
<publish>
<publish_date>2000-10-01</publish_date>
<info>this is the guide to XML</info>
</publish>
</books>
<books>
<book id>2</book id>
<title>Python for beginners</title>
<price>21.50</price>
<publish>
<publish_date>2002-005-5</publish_date>
<info>this is the guide to Python</info>
</publish>
</books>
</catalog>
</RecordContainer>
<RecordContainer RecordNumber = "2">
<catalog>
<Person id="bk102">
<person>
<author>Ralls, Kim</author>
<personal_info>
<age>29</age>
<DOB>11-05-1994</DOB>
</personal_info>
</person>
<books>
<book id>1</book id>
<title>Scala Prgramming</title>
<price>15.90</price>
<publish>
<publish_date>2011-04-10</publish_date>
<info>this is the guide to Scala Programming</info>
</publish>
</books>
<books>
<book id>2</book id>
<title>PySpark for beginners</title>
<price>25.50</price>
<publish>
<publish_date>2012-07-21</publish_date>
<info>PySpark Guide</info>
</publish>
</books>
</catalog>
</RecordContainer>
我的预期输出是带有选定标签的pandas数据框,其值如下所示:
Record_Number PersonID author DOB book_id1 title1 publish_date1 book_id2 title2 publish_date2
我尝试使用 .find(.// element),但无法单独访问每个元素。为了解析文件,我使用了以下代码:
from lxml import etree
tree = etree.fromstring("<root>"+input_data+"</root>")
上面的代码尝试使用.find()获取每个元素标签和文本后,却没有显示任何文本。