使用python xml到json使用相同名称的嵌套标记解析错误

时间:2014-11-09 17:20:38

标签: python xml json mongodb gzip

我有一个小的设置我有一个大的xml文件,格式如下

<doc id="1">Some text</doc>
<doc id="2">more text</doc>

我使用以下python脚本转换为json格式:

from sys import stdout

import xmltodict
import gzip
import json

count = 0
xmlSrc = 'text.xml.gz'
jsDest = 'js/cd.js'

def parseNode(_, node):
    global count
    count += 1
    stdout.write("\r%d" % count)

    jsonNode = json.dumps(node)
    f.write(jsonNode + '\n')
    return True

f = open(jsDest, 'w')

xmltodict.parse(gzip.open(xmlSrc), item_depth=2, item_callback=parseNode)

f.close()

stdout.write("\n") # move the cursor to the next line

是否有可能检测到结束</doc>并中断并继续转换?我看过其他stackoverflow问题但没有帮助。 How do you parse nested XML tags with python?

1 个答案:

答案 0 :(得分:0)

由于您的<doc>标记本身没有嵌套,您可以迭代文档并手动序列化对象并转储到json,这是一个示例:

import xml.etree.ElementTree as ET
import json

s= '''
<root>
    <abc>
        <doc id="12" url="example.com" title="Anarchism"> Anarchism .... </doc>
    </abc>
    <doc id="123" url="example2" title="Laptop"> Laptop .... </doc>
    <def>
        <doc id="3">Final text</doc>
    </def>
</root>
'''

tree = ET.fromstring(s)
j = []
# use iterfind will iterate the element tree and find the doc element
for node in tree.iterfind('.//doc'):
    # manually build the dict with doc attribute and text
    attrib = {}
    attrib.update(node.attrib)
    attrib.update({'text': node.text})
    d = {'doc': [ attrib ] }
    j.append(d)

json.dumps(j)
'[{"doc": [{"url": "example.com", "text": " Anarchism .... ", "id": "12", "title": "Anarchism"}]}, {"doc": [{"url": "example2", "text": " Laptop .... ", "id": "123", "title": "Laptop"}]}, {"doc": [{"text": "Final text", "id": "3"}]}]'

# to write to a json file
with open('yourjsonfile', 'w') as f:
    f.write(json.dumps(j))