我有一个小的设置我有一个大的xml文件,格式如下
<doc id="1">Some text</doc>
<doc id="2">more text</doc>
我使用以下python脚本转换为json格式:
from sys import stdout
import xmltodict
import gzip
import json
count = 0
xmlSrc = 'text.xml.gz'
jsDest = 'js/cd.js'
def parseNode(_, node):
global count
count += 1
stdout.write("\r%d" % count)
jsonNode = json.dumps(node)
f.write(jsonNode + '\n')
return True
f = open(jsDest, 'w')
xmltodict.parse(gzip.open(xmlSrc), item_depth=2, item_callback=parseNode)
f.close()
stdout.write("\n") # move the cursor to the next line
是否有可能检测到结束</doc>
并中断并继续转换?我看过其他stackoverflow问题但没有帮助。 How do you parse nested XML tags with python?
答案 0 :(得分:0)
由于您的<doc>
标记本身没有嵌套,您可以迭代文档并手动序列化对象并转储到json
,这是一个示例:
import xml.etree.ElementTree as ET
import json
s= '''
<root>
<abc>
<doc id="12" url="example.com" title="Anarchism"> Anarchism .... </doc>
</abc>
<doc id="123" url="example2" title="Laptop"> Laptop .... </doc>
<def>
<doc id="3">Final text</doc>
</def>
</root>
'''
tree = ET.fromstring(s)
j = []
# use iterfind will iterate the element tree and find the doc element
for node in tree.iterfind('.//doc'):
# manually build the dict with doc attribute and text
attrib = {}
attrib.update(node.attrib)
attrib.update({'text': node.text})
d = {'doc': [ attrib ] }
j.append(d)
json.dumps(j)
'[{"doc": [{"url": "example.com", "text": " Anarchism .... ", "id": "12", "title": "Anarchism"}]}, {"doc": [{"url": "example2", "text": " Laptop .... ", "id": "123", "title": "Laptop"}]}, {"doc": [{"text": "Final text", "id": "3"}]}]'
# to write to a json file
with open('yourjsonfile', 'w') as f:
f.write(json.dumps(j))