使用xml.etree.ElementTree.interparse功能将XML转换为CSV

时间:2017-11-01 18:09:08

标签: python xml csv

伙计们,我是python的新手(全新的),所以在选修课程后,我决定创建一个脚本,将XML文件转换为CSV格式。有问题的文件大小为2GB,所以在这里和谷歌搜索之后我想我需要使用xml.etree.ElementTree.interparse功能。作为参考,我希望隐藏的XML文件如下所示:

<Document>
  <type></type>
  <internal_id></internal_id>
  <name></name>
  <number></number>
  <cadname></cadname>
  <version></version>
  <iteration></iteration>
  **<isLatest></isLatest>**
  <modifiedBy>
     <username></username>
     <email/>
  </modifiedBy>
  <content>
     **<name></name>**
     <id></id>
     <uploaded></uploaded>
     <refSize></refSize>
     <storage>
        <vault></vault>
        <folder></folder>
        **<filename></filename>**
        <location></location>
        **<actualLocation></actualLocation>**
     </storage>
     <replicatedTo></replicatedTo>
     <copies></copies>
     <status></status>
  </content>

我使用isLatest的值来确定是否需要将项目添加到CSV文件中。如果值为“true”,我希望数据移动到CSV文件。以下是适用于某一点的代码:

import xml.etree.ElementTree as ET
import csv

parser = ET.iterparse("windchill.xml")

# open a file for writing

csvfile = open('windchill.txt', 'w', encoding="utf-8")

# create the csv writer object

csvwriter = csv.writer(csvfile)
count = 0
for event, document in parser:

if document.tag == 'Document':
    if document.find('isLatest').text == 'true':
        row = []
        name = document.find('content').find('name').text
        row.append(name)
        filename = document.find('content').find('storage').find('filename').text
        row.append(filename)
        folder = document.find('content').find('storage').find('actualLocation').text
        row.append(folder)
        csvwriter.writerow(row)
        document.clear()
csvfile.close()

如果我运行代码,我会收到此错误:

Traceback (most recent call last):
  File "C:/Users/mike/PycharmProjects/windchill/xml2csv-stream.py", line 17, in <module>
if document.find('isLatest').text == 'true':
AttributeError: 'NoneType' object has no attribute 'text'

创建一个包含91,000个条目的文件,如下所示:

plate.prt,000000000518e8,/vault/Vlt7
adhesive.prt,0000000005024b,/vault/Vlt7
brd_pad.prt,00000000057862,/vault/Vlt7
support_pad.prt,0000000005024c,/vault/Vlt7
ground.prt,0000000005089b,/vault/Vlt7

输出似乎有两个问题。

  1. 虽然源文件没有重复,但有些项似乎是重复的。名称可以在源文件中重复,但只能有一个名称值。
  2. 我认为文件没有完成。我查看了TXT(CSV)文件的最后一个条目,它与我的源文件的最后一行不匹配。我假设迭代器本质上是串行的。
  3. 那么,任何想法错误告诉我什么,以及为什么我可能会看到重复?最初我认为错误可能与我没有优雅地结束有关。我相信整个XML都是正确形成的,但也许这是一个不好的假设。

    ****** ****** UPDATES

    以下是元素的示例。

    <Document>
      <type>wt.epm.EPMDocument</type>
      <internal_id>33709881</internal_id>
      <name>bga_13x11p137_0_4_0_8.prt</name>
      <number>BGA_13X11P137_0_4_0_8.PRT</number>
      <cadname>bga_13x11p137_0_4_0_8.prt</cadname>
      <version>A</version>
      <iteration>1</iteration>
      <isLatest>false</isLatest>
      <modifiedBy>
         <username>ets027 (deleted)</username>
         <email/>
      </modifiedBy>
      <content>
         <name>bga_13x11p137_0_4_0_8.prt</name>
         <id>5341368</id>
         <uploaded>Jan 13, 2006 09:14:41</uploaded>
         <refSize>287764</refSize>
         <storage>
            <vault>master_vault</vault>
            <folder>master_vault7</folder>
            <filename>000000000505a6</filename>
            <location>[wt.fv.FvItem:33709835]::master::master_vault::master_vault7::000000000505a6</location>
            <actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
         </storage>
         <replicatedTo>
         </replicatedTo>
         <copies>
         </copies>
         <status>Content File Missing</status>
      </content>
    </Document>
    <Document>
      <type>wt.epm.EPMDocument</type>
      <internal_id>34570129</internal_id>
      <name>d61-2446-02_nest_plate.prt</name>
      <number>D61-2446-02_NEST_PLATE.PRT</number>
      <cadname>d61-2446-02_nest_plate.prt</cadname>
      <version>-</version>
      <iteration>1</iteration>
      <isLatest>true</isLatest>
      <modifiedBy>
         <username>esb044c (deleted)</username>
         <email/>
      </modifiedBy>
      <content>
         <name>d61-2446-02_nest_plate.prt</name>
         <id>5344204</id>
         <uploaded>Jan 30, 2006 09:09:24</uploaded>
         <refSize>109278</refSize>
         <storage>
            <vault>master_vault</vault>
            <folder>master_vault7</folder>
            <filename>000000000518e8</filename>
            <location>[wt.fv.FvItem:34566594]::master::master_vault::master_vault7::000000000518e8</location>
            <actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
         </storage>
         <replicatedTo>
         </replicatedTo>
         <copies>
         </copies>
         <status>Content File Missing</status>
      </content>
    </Document>
    <Document>
      <type>wt.epm.EPMDocument</type>
      <internal_id>33512036</internal_id>
      <name>d68-2568-07_press_head_adhesive.prt</name>
      <number>D68-2568-07_PRESS_HEAD_ADHESIVE.PRT</number>
      <cadname>d68-2568-07_press_head_adhesive.prt</cadname>
      <version>-</version>
      <iteration>2</iteration>
      <isLatest>true</isLatest>
      <modifiedBy>
         <username>e3789c (deleted)</username>
         <email/>
      </modifiedBy>
      <content>
         <name>d68-2568-07_press_head_adhesive.prt</name>
         <id>5340927</id>
         <uploaded>Jan 10, 2006 15:42:31</uploaded>
         <refSize>76314</refSize>
         <storage>
            <vault>master_vault</vault>
            <folder>master_vault7</folder>
            <filename>0000000005024b</filename>
            <location>[wt.fv.FvItem:33512072]::master::master_vault::master_vault7::0000000005024b</location>
            <actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
         </storage>
         <replicatedTo>
         </replicatedTo>
         <copies>
         </copies>
         <status>Content File Missing</status>
      </content>
    </Document>
    <Document>
      <type>wt.epm.EPMDocument</type>
      <internal_id>34715717</internal_id>
      <name>dbk_flip_sleeve.prt</name>
      <number>DBK_FLIP_SLEEVE.PRT</number>
      <cadname>dbk_flip_sleeve.prt</cadname>
      <version>-</version>
      <iteration>1</iteration>
      <isLatest>false</isLatest>
      <modifiedBy>
         <username>EKA014 (deleted)</username>
         <email/>
      </modifiedBy>
      <content>
         <name>dbk_flip_sleeve.prt</name>
         <id>5344969</id>
         <uploaded>Feb 01, 2006 12:54:43</uploaded>
         <refSize>847210</refSize>
         <storage>
            <vault>master_vault</vault>
            <folder>master_vault7</folder>
            <filename>00000000051b54</filename>
            <location>[wt.fv.FvItem:34714395]::master::master_vault::master_vault7::00000000051b54</location>
            <actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
         </storage>
         <replicatedTo>
         </replicatedTo>
         <copies>
         </copies>
         <status>Content File Missing</status>
      </content>
     </Document>
    

    这是我更新的代码:

    import xml.etree.ElementTree as ET
    import csv
    
    parser = ET.iterparse("windchill.xml", events=('start', 'end'))
    
    csvfile = open('windchill.txt', 'w', encoding="utf-8")
    
    csvwriter = csv.writer(csvfile)
    
    for event, document in parser:
    
    if event=='end' and document.tag=='Document':
        if document.find('type').text == 'wt.epm.EPMDocument' and document.find('isLatest').text == 'true':
            row = []
            version = document.find('version').text
            row.append(version)
            name = document.find('content').find('name').text
            row.append(name)
            filename = document.find('content').find('storage').find('filename').text
            row.append(filename)
    #            folder = document.find('content').find('storage').find('actualLocation').text
            folder = document.find('content').find('storage').find('folder').text
            row.append(folder)
            csvwriter.writerow(row)
    
    csvfile.close()
    

    我在检查中添加了类型。键入wt.ep.EPMDocument将具有该记录。然后我想从存储元素中提取数据。特别是名称,文件夹和文件名。我最初使用的是actualLocation而不是ov保险库,但改变了希望较短的名称可以帮助解决我的内存错误。

1 个答案:

答案 0 :(得分:0)

关于你的第一个问题:iterparse'看到'文件中的每个xml元素,当该元素开始时,再次关闭时。这可能解释了您找到的重复。您不仅必须过滤所需的元素,还必须筛选相应的事件。您可以查看此答案https://stackoverflow.com/a/46167799/131187,了解如何处理此问题。

关于第二个:当document.find('isLatest')无法找到您请求的内容时,它返回None,而不是表示xml元素的对象。 None没有属性,包括text,因此,您的程序会在那时出现错误,并且您收到的csv文件不完整。

编辑回答评论:此代码解析xml但不写csv。 csv记录将写在save_csv_record函数或其等效函数中。它在代码中只出现一次,因此应该很容易替换。

以此代码中的方式调用iterparse仅返回'end'事件及其对应的xml元素。因此,代码监视“文档”的“结束”。当它看到一个时,它会询问文档是否包含'isLatest'的'true'。如果它确实写出来了;如果没有,它会忽略它并清空document_content。如果代码没有看到文档的“结束”,它只是保存标记的内容并继续阅读它。

from xml.etree.ElementTree import iterparse

def save_csv_record(record):
    print(record)
    return

document_content = {}
for ev, el in iterparse('windchill.xml'):
    if el.tag=='Document':
        if document_content['isLatest'] == 'true':
            save_csv_record(document_content)
        document_content = {}
    else:
        document_content[el.tag] = el.text.strip() if el.text else None

输出:

{'folder': 'master_vault7', 'storage': '', 'refSize': '109278', 'cadname': 'd61-2446-02_nest_plate.prt', 'filename': '000000000518e8', 'replicatedTo': '', 'status': 'Content File Missing', 'number': 'D61-2446-02_NEST_PLATE.PRT', 'location': '[wt.fv.FvItem:34566594]::master::master_vault::master_vault7::000000000518e8', 'vault': 'master_vault', 'uploaded': 'Jan 30, 2006 09:09:24', 'id': '5344204', 'actualLocation': '/vault/Windchill_Vaults/WcVlt7', 'name': 'd61-2446-02_nest_plate.prt', 'modifiedBy': '', 'email': None, 'content': '', 'internal_id': '34570129', 'iteration': '1', 'username': 'esb044c (deleted)', 'type': 'wt.epm.EPMDocument', 'copies': '', 'isLatest': 'true', 'version': '-'}
{'folder': 'master_vault7', 'storage': '', 'refSize': '76314', 'cadname': 'd68-2568-07_press_head_adhesive.prt', 'filename': '0000000005024b', 'replicatedTo': '', 'status': 'Content File Missing', 'number': 'D68-2568-07_PRESS_HEAD_ADHESIVE.PRT', 'location': '[wt.fv.FvItem:33512072]::master::master_vault::master_vault7::0000000005024b', 'vault': 'master_vault', 'uploaded': 'Jan 10, 2006 15:42:31', 'id': '5340927', 'actualLocation': '/vault/Windchill_Vaults/WcVlt7', 'name': 'd68-2568-07_press_head_adhesive.prt', 'modifiedBy': '', 'email': None, 'content': '', 'internal_id': '33512036', 'iteration': '2', 'username': 'e3789c (deleted)', 'type': 'wt.epm.EPMDocument', 'copies': '', 'isLatest': 'true', 'version': '-'}

编辑最新代码:

以下是我正在使用的新代码,即sill内存不足:

from xml.etree.ElementTree import iterparse

def save_csv_record(record):
 print(record)
 return

document_content = {}
for ev, el in iterparse('windchill.xml'):
if el.tag=='Document':
    if document_content['type']=='wt.epm.EPMDocument' and 
document_content['isLatest'] == 'true':
        save_csv_record(document_content)
    document_content = {}
else:
    document_content[el.tag] = el.text.strip() if el.text else None