将大量XML文件解析为JSON

时间:2017-06-29 06:30:16

标签: python json xml parsing

我正在开发一个项目,要求我将大量的XML文件解析为JSON。我编写了代码,但速度太慢了。我已经看过使用 lxml BeautifulSoup ,但我不确定如何继续。

我已经包含了我的代码。它的工作原理完全如此,除非它太慢了。花了大约24小时来完成一个低于100Mb的文件来解析100,000条记录。

product_data = open('productdata_29.xml', 'r')
read_product_data = product_data.read()


def record_string_to_dict(record_string):
'''This function takes a single record in string form and iterates through
it, and sorts it as a dictionary. Only the nodes present in the parent_rss dict
are appended to the new dict (single_record_dict). After each record,
single_record_dict is flushed to final_list and is then emptied.'''

    #Iterating through the string to find keys and values to put in to
    #single_record_dict.
    while record_string != record_string[::-1]:

        try:
            k = record_string.index('<')

            l = record_string.index('>')
            temp_key = record_string[k + 1:l]
            record_string = record_string[l+1:]
            m = record_string.index('<')
            temp_value = record_string[:m]

            #Cleaning thhe keys and values of unnecessary characters and symbols.  
            if '\n' in temp_value:
                temp_value = temp_value[3:]
            if temp_key[-1] == '/':
                temp_key = temp_key[:-1]

            n = record_string.index('\n')
            record_string = record_string[n+2:]

            #Checking parent_rss dict to see if the key from the record is present. If it is,
            #the key is replaced with keys and added to single_record_dictionary.
            if temp_key in mapped_nodes.keys():
                temp_key = mapped_nodes[temp_key]
                single_record_dict[temp_key] = temp_value

        except Exception:
            break


    while len(read_product_data) > 10:

        #Goes through read_product_data to create blocks, each of which is a single
        #record.
        i = read_product_data.index('<record>')
        j = read_product_data.index('</record>') + 8
        single_record_string = read_product_data[i:j]
        single_record_string = single_record_string[9:-10]

        #Runs previous function with the input being the single string found previously.
        record_string_to_dict(single_record_string)

        #Flushes single_record_dict to final_list, and empties the dict for the next
        #record.
        final_list.append(single_record_dict)
        single_record_dict = {}

        #Removes the record that was previously processed.
        read_product_data = read_product_data[j:]

        #For keeping track/ease of use.
        print('Record ' + str(break_counter) + ' has been appended.')

        #Keeps track of the number of records. Once the set value is reached
        #in the if loop, it is flushed to a new file.
        break_counter += 1
        flush_counter += 1

        if break_counter == 100 or flush_counter == break_counter:
            record_list = open('record_list_'+str(file_counter)+'.txt', 'w')
            record_list.write(str(final_list))

            #file_counter keeps track of how many files have been created, so the next
            #file has a different int at the end.
            file_counter += 1
            record_list.close()

            #resets break counter 
            break_counter = 0
            final_list = []
        #For testing purposes. Causes execution to stop once the number of files written
        #matches the integer.
        if file_counter == 2:
            break

    print('All records have been appended.')

2 个答案:

答案 0 :(得分:2)

任何原因,为什么不考虑xml2jsonxml2dict等软件包。有关工作示例,请参阅此帖子:  How can i convert an xml file into JSON using python?

从上述帖子转载的相关代码:

<强> xml2json

import xml2json
s = '''<?xml version="1.0"?>
    <note>
       <to>Tove</to>
       <from>Jani</from>
       <heading>Reminder</heading>
       <body>Don't forget me this weekend!</body>
    </note>'''
print xml2json.xml2json(s)

<强> xmltodict

import xmltodict, json
o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>')
json.dumps(o) # '{"e": {"a": ["text", "text"]}}'

如果在Python 3中工作,请参阅此文章: https://pythonadventures.wordpress.com/2014/12/29/xml-to-dict-xml-to-json/

import json
import xmltodict

def convert(xml_file, xml_attribs=True):
    with open(xml_file, "rb") as f:    # notice the "rb" mode
        d = xmltodict.parse(f, xml_attribs=xml_attribs)
        return json.dumps(d, indent=4)

答案 1 :(得分:0)

你绝对不想手工解析XML。和其他人提到的库一样,您可以使用XSLT 3.0处理器。要超过100Mb,你将受益于像Saxon-EE这样的流处理器,但是开源Saxon-HE应该能够破解它。您还没有显示源XML或目标JSON,因此我无法为您提供特定代码 - XSLT 3.0中的假设是您可能需要定制转换而不是现成转换,因此一般的想法是编写模板规则,定义如何处理输入XML的不同部分。