Question

我正在将一堆大型xml文件解析为python中的sqlite3数据库。尽我所能告诉（尽管我很乐意并寻求更多性能选项），性能更高的选项是sqlite3的executemany()插入函数。

我目前正在做的事情如下：

document_dir = '/documents'

Document = named_tuple('Document', 'doc_id doc_title doc_mentioned_people ... etc')
People = named_tuple('People', 'doc_id first_name last_name ... ') 

class DocumentXML(object):
    """
    ... there's some stuff here, but you get the idea

    """

    def parse_document(path):
        """
        This object keeps track of the current 'document' type element from a cElementTree.iterparse() elsewhere

        I've simplified things here, but you can get the idea that this is providing a named tuple for a generator
        """
        doc_id = _current_element.findall(xpath = '../id')[0].text
        doc_title = _current_element.findall(xpath = '../title')[0].text

        # parse lists of people here

        doc_mentioned_people = People(first_name, last_name, ..., person_id)
        #etc...
        return Document(doc_id, doc_title, doc_mentioned_people, ..., etc)

def doc_generator():
    documents = parse_document(document_dir)
    for doc in documents:
        yield doc.id, doc.title, ..., doc.date



# Import into Table 1
with cursor(True) as c:
        c.executemany("INSERT INTO Document VALUES (?,?,?,?,?,?,?,?,?,?,?);", doc_generator())



def people_generator():
    documents = parse_document(document_dir)
    for doc in documents:
        people = doc.people
        yield people.firstname, people.lastname ..., people.eyecolor


# Import into Table 2
with cursor(True) as c:
        c.executemany("INSERT INTO Document VALUES (?,?,?,?,?,?,?,?,?,?,?);", people_generator())


# This goes on for several tables...

如您所见，这里的效率很低。每个xml文件都会重复解析，解析次数与数据库中的表相同。

我只想利用XML的一种解析（因为我可以在一个命名的元组中产生所有相关信息），但是将结构保留为生成器，以免将内存需求炸到无法实现的水平。

有什么好方法吗？

我的尝试一直围绕着使用executemany和双插入类型的语句，例如：

c.executemany("
    INSERT INTO Document VALUES (?,?,?,?,?,?,?,?,?,?,?);
    INSERT INTO People VALUES (?,?,?,?,?,?,?); 
    INSERT INTO Companies VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?); 
    INSERT INTO Oils VALUES (?,?,?,?,?,?,?); 
    INSERT INTO Physics VALUES (?,?,?,?,?,?,?,?,?,?,?)",
        complete_data_generator())

complete_data_generator()产生所有相关的结构化信息；但是，我知道这可能行不通。

是否有更好的方法来构建此文件以提高性能？

Answer 1

如果您的小型文档很少，则将所有内容加载到内存中，而不再因重新解析文档而烦恼。

如果只有一个表要馈送，则生成器方法会很好。

如果这两种方法都不适合，我将尝试采用中级方法：

解析一堆XML文件并累积许多doc元素
当可用的文档数量为合理时，您将暂停解析，并开始使用该数量的文档上的 executemany 来馈送数据库表
插入文档的 bag 后，您可以选择提交以释放SQLite日记文件，然后继续解析

优点：

文件仅解析一次
可以通过中间提交来控制SQLite数据库上的负载
您仍然使用executemany

缺点：

根据数据量，多次调用executemany
每次提交都需要一些时间

在多个表上插入SQLite3 executemany（）的生成器的高效设计

1 个答案: