伙计们,我是python的新手(全新的),所以在选修课程后,我决定创建一个脚本,将XML文件转换为CSV格式。有问题的文件大小为2GB,所以在这里和谷歌搜索之后我想我需要使用xml.etree.ElementTree.interparse功能。作为参考,我希望隐藏的XML文件如下所示:
<Document>
<type></type>
<internal_id></internal_id>
<name></name>
<number></number>
<cadname></cadname>
<version></version>
<iteration></iteration>
**<isLatest></isLatest>**
<modifiedBy>
<username></username>
<email/>
</modifiedBy>
<content>
**<name></name>**
<id></id>
<uploaded></uploaded>
<refSize></refSize>
<storage>
<vault></vault>
<folder></folder>
**<filename></filename>**
<location></location>
**<actualLocation></actualLocation>**
</storage>
<replicatedTo></replicatedTo>
<copies></copies>
<status></status>
</content>
我使用isLatest的值来确定是否需要将项目添加到CSV文件中。如果值为“true”,我希望数据移动到CSV文件。以下是适用于某一点的代码:
import xml.etree.ElementTree as ET
import csv
parser = ET.iterparse("windchill.xml")
# open a file for writing
csvfile = open('windchill.txt', 'w', encoding="utf-8")
# create the csv writer object
csvwriter = csv.writer(csvfile)
count = 0
for event, document in parser:
if document.tag == 'Document':
if document.find('isLatest').text == 'true':
row = []
name = document.find('content').find('name').text
row.append(name)
filename = document.find('content').find('storage').find('filename').text
row.append(filename)
folder = document.find('content').find('storage').find('actualLocation').text
row.append(folder)
csvwriter.writerow(row)
document.clear()
csvfile.close()
如果我运行代码,我会收到此错误:
Traceback (most recent call last):
File "C:/Users/mike/PycharmProjects/windchill/xml2csv-stream.py", line 17, in <module>
if document.find('isLatest').text == 'true':
AttributeError: 'NoneType' object has no attribute 'text'
创建一个包含91,000个条目的文件,如下所示:
plate.prt,000000000518e8,/vault/Vlt7
adhesive.prt,0000000005024b,/vault/Vlt7
brd_pad.prt,00000000057862,/vault/Vlt7
support_pad.prt,0000000005024c,/vault/Vlt7
ground.prt,0000000005089b,/vault/Vlt7
输出似乎有两个问题。
那么,任何想法错误告诉我什么,以及为什么我可能会看到重复?最初我认为错误可能与我没有优雅地结束有关。我相信整个XML都是正确形成的,但也许这是一个不好的假设。
****** ****** UPDATES
以下是元素的示例。
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>33709881</internal_id>
<name>bga_13x11p137_0_4_0_8.prt</name>
<number>BGA_13X11P137_0_4_0_8.PRT</number>
<cadname>bga_13x11p137_0_4_0_8.prt</cadname>
<version>A</version>
<iteration>1</iteration>
<isLatest>false</isLatest>
<modifiedBy>
<username>ets027 (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>bga_13x11p137_0_4_0_8.prt</name>
<id>5341368</id>
<uploaded>Jan 13, 2006 09:14:41</uploaded>
<refSize>287764</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>000000000505a6</filename>
<location>[wt.fv.FvItem:33709835]::master::master_vault::master_vault7::000000000505a6</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>34570129</internal_id>
<name>d61-2446-02_nest_plate.prt</name>
<number>D61-2446-02_NEST_PLATE.PRT</number>
<cadname>d61-2446-02_nest_plate.prt</cadname>
<version>-</version>
<iteration>1</iteration>
<isLatest>true</isLatest>
<modifiedBy>
<username>esb044c (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>d61-2446-02_nest_plate.prt</name>
<id>5344204</id>
<uploaded>Jan 30, 2006 09:09:24</uploaded>
<refSize>109278</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>000000000518e8</filename>
<location>[wt.fv.FvItem:34566594]::master::master_vault::master_vault7::000000000518e8</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>33512036</internal_id>
<name>d68-2568-07_press_head_adhesive.prt</name>
<number>D68-2568-07_PRESS_HEAD_ADHESIVE.PRT</number>
<cadname>d68-2568-07_press_head_adhesive.prt</cadname>
<version>-</version>
<iteration>2</iteration>
<isLatest>true</isLatest>
<modifiedBy>
<username>e3789c (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>d68-2568-07_press_head_adhesive.prt</name>
<id>5340927</id>
<uploaded>Jan 10, 2006 15:42:31</uploaded>
<refSize>76314</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>0000000005024b</filename>
<location>[wt.fv.FvItem:33512072]::master::master_vault::master_vault7::0000000005024b</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>34715717</internal_id>
<name>dbk_flip_sleeve.prt</name>
<number>DBK_FLIP_SLEEVE.PRT</number>
<cadname>dbk_flip_sleeve.prt</cadname>
<version>-</version>
<iteration>1</iteration>
<isLatest>false</isLatest>
<modifiedBy>
<username>EKA014 (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>dbk_flip_sleeve.prt</name>
<id>5344969</id>
<uploaded>Feb 01, 2006 12:54:43</uploaded>
<refSize>847210</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>00000000051b54</filename>
<location>[wt.fv.FvItem:34714395]::master::master_vault::master_vault7::00000000051b54</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
这是我更新的代码:
import xml.etree.ElementTree as ET
import csv
parser = ET.iterparse("windchill.xml", events=('start', 'end'))
csvfile = open('windchill.txt', 'w', encoding="utf-8")
csvwriter = csv.writer(csvfile)
for event, document in parser:
if event=='end' and document.tag=='Document':
if document.find('type').text == 'wt.epm.EPMDocument' and document.find('isLatest').text == 'true':
row = []
version = document.find('version').text
row.append(version)
name = document.find('content').find('name').text
row.append(name)
filename = document.find('content').find('storage').find('filename').text
row.append(filename)
# folder = document.find('content').find('storage').find('actualLocation').text
folder = document.find('content').find('storage').find('folder').text
row.append(folder)
csvwriter.writerow(row)
csvfile.close()
我在检查中添加了类型。键入wt.ep.EPMDocument将具有该记录。然后我想从存储元素中提取数据。特别是名称,文件夹和文件名。我最初使用的是actualLocation而不是ov保险库,但改变了希望较短的名称可以帮助解决我的内存错误。
答案 0 :(得分:0)
关于你的第一个问题:iterparse
'看到'文件中的每个xml元素,当该元素开始时,再次关闭时。这可能解释了您找到的重复。您不仅必须过滤所需的元素,还必须筛选相应的事件。您可以查看此答案https://stackoverflow.com/a/46167799/131187,了解如何处理此问题。
关于第二个:当document.find('isLatest')
无法找到您请求的内容时,它返回None
,而不是表示xml元素的对象。 None
没有属性,包括text
,因此,您的程序会在那时出现错误,并且您收到的csv文件不完整。
编辑回答评论:此代码解析xml但不写csv。 csv记录将写在save_csv_record
函数或其等效函数中。它在代码中只出现一次,因此应该很容易替换。
以此代码中的方式调用iterparse
仅返回'end'事件及其对应的xml元素。因此,代码监视“文档”的“结束”。当它看到一个时,它会询问文档是否包含'isLatest'的'true'。如果它确实写出来了;如果没有,它会忽略它并清空document_content
。如果代码没有看到文档的“结束”,它只是保存标记的内容并继续阅读它。
from xml.etree.ElementTree import iterparse
def save_csv_record(record):
print(record)
return
document_content = {}
for ev, el in iterparse('windchill.xml'):
if el.tag=='Document':
if document_content['isLatest'] == 'true':
save_csv_record(document_content)
document_content = {}
else:
document_content[el.tag] = el.text.strip() if el.text else None
输出:
{'folder': 'master_vault7', 'storage': '', 'refSize': '109278', 'cadname': 'd61-2446-02_nest_plate.prt', 'filename': '000000000518e8', 'replicatedTo': '', 'status': 'Content File Missing', 'number': 'D61-2446-02_NEST_PLATE.PRT', 'location': '[wt.fv.FvItem:34566594]::master::master_vault::master_vault7::000000000518e8', 'vault': 'master_vault', 'uploaded': 'Jan 30, 2006 09:09:24', 'id': '5344204', 'actualLocation': '/vault/Windchill_Vaults/WcVlt7', 'name': 'd61-2446-02_nest_plate.prt', 'modifiedBy': '', 'email': None, 'content': '', 'internal_id': '34570129', 'iteration': '1', 'username': 'esb044c (deleted)', 'type': 'wt.epm.EPMDocument', 'copies': '', 'isLatest': 'true', 'version': '-'}
{'folder': 'master_vault7', 'storage': '', 'refSize': '76314', 'cadname': 'd68-2568-07_press_head_adhesive.prt', 'filename': '0000000005024b', 'replicatedTo': '', 'status': 'Content File Missing', 'number': 'D68-2568-07_PRESS_HEAD_ADHESIVE.PRT', 'location': '[wt.fv.FvItem:33512072]::master::master_vault::master_vault7::0000000005024b', 'vault': 'master_vault', 'uploaded': 'Jan 10, 2006 15:42:31', 'id': '5340927', 'actualLocation': '/vault/Windchill_Vaults/WcVlt7', 'name': 'd68-2568-07_press_head_adhesive.prt', 'modifiedBy': '', 'email': None, 'content': '', 'internal_id': '33512036', 'iteration': '2', 'username': 'e3789c (deleted)', 'type': 'wt.epm.EPMDocument', 'copies': '', 'isLatest': 'true', 'version': '-'}
编辑最新代码:
以下是我正在使用的新代码,即sill内存不足:
from xml.etree.ElementTree import iterparse
def save_csv_record(record):
print(record)
return
document_content = {}
for ev, el in iterparse('windchill.xml'):
if el.tag=='Document':
if document_content['type']=='wt.epm.EPMDocument' and
document_content['isLatest'] == 'true':
save_csv_record(document_content)
document_content = {}
else:
document_content[el.tag] = el.text.strip() if el.text else None