如何仅使用选定标签及其在python中的值将字符串XML转换为数据框

时间:2019-10-06 13:18:17

标签: python xml pandas dataframe

我的大型XML字符串示例如下:

<RecordContainer RecordNumber = "1">
<catalog>
   <Person id="bk101">
      <person>
         <author>Gambardella, Matthew</author>
         <personal_info>
            <age>40</age>
            <DOB>19-02-1988</DOB>
         </personal_info> 
      </person>
      <books>
          <book id>1</book id>
          <title>XML Developer's Guide</title>
          <price>44.95</price>
          <publish>
             <publish_date>2000-10-01</publish_date>
             <info>this is the guide to XML</info>
          </publish>
      </books>
      <books>
          <book id>2</book id>
          <title>Python for beginners</title>
          <price>21.50</price>
          <publish>
             <publish_date>2002-005-5</publish_date>
             <info>this is the guide to Python</info>
          </publish>
      </books>
 </catalog>
</RecordContainer>
<RecordContainer RecordNumber = "2">
 <catalog>  
   <Person id="bk102">
      <person>
        <author>Ralls, Kim</author>
        <personal_info>
            <age>29</age>
            <DOB>11-05-1994</DOB>
         </personal_info> 
      </person>
      <books>
          <book id>1</book id>
          <title>Scala Prgramming</title>
          <price>15.90</price>
          <publish>
             <publish_date>2011-04-10</publish_date>
             <info>this is the guide to Scala Programming</info>
          </publish>
      </books>
      <books>
          <book id>2</book id>
          <title>PySpark for beginners</title>
          <price>25.50</price>
          <publish>
             <publish_date>2012-07-21</publish_date>
             <info>PySpark Guide</info>
          </publish>
      </books>
 </catalog>
</RecordContainer>

我的预期输出是带有选定标签的pandas数据框,其值如下所示:

Record_Number  PersonID  author   DOB   book_id1  title1   publish_date1  book_id2  title2  publish_date2

我尝试使用 .find(.// element),但无法单独访问每个元素。为了解析文件,我使用了以下代码:

from lxml import etree
tree = etree.fromstring("<root>"+input_data+"</root>")

上面的代码尝试使用.find()获取每个元素标签和文本后,却没有显示任何文本。

0 个答案:

没有答案