无法将lxml etree对象传递给单独的进程

时间:2014-09-23 09:44:24

标签: python lxml python-multiprocessing

我正在使用lxml在python中同时解析多个xml文件的项目。当我初始化进程时,我希望我的主类在将etree对象传递给进程之前对XML做一些工作,但我发现当etree对象到达新进程时,类仍然存活但是XML已经从在对象中,getroot()返回None。

我知道我只能使用队列传递可选择的数据,但是这也是我传递给' args'内部进程的情况。场?

这是我的代码:

import multiprocessing, multiprocessing.pool, time
from lxml import etree

def compute(tree):
    print("Start Process")
    print(type(tree)) # Returns <class 'lxml.etree._ElementTree'>
    print(id(tree)) # Returns new ID 44637320 as expected
    print(tree.getroot()) # Returns None

def pool_init(queue):
    # see http://stackoverflow.com/a/3843313/852994
    compute.queue = queue

class Main():
    def __init__(self):
        pass

    def main(self):
        tree = etree.parse('test.xml')
        print(id(tree)) # Returns object ID 43998536
        print(tree.getroot()) #Returns <Element SymCLI_ML at 0x29f5dc8>

        self.queue = multiprocessing.Queue()
        self.pool = multiprocessing.Pool(processes=1, initializer=pool_init, initargs=(self.queue,))
        self.pool.apply_async(func=compute, args=(tree,))
        time.sleep(10)

if __name__ == '__main__':
    Main().main()

任何和所有帮助都非常感激。

UPDATE / ANSWER

根据下一篇文章的回答,我对它进行了一些修改,并设法让它在不使用String IO的情况下以更低的内存占用。 etree.tostring方法返回一个字节数组,可以对其进行pickle,然后取消对它的解释,字节数组可以由etree解析。

import multiprocessing, multiprocessing.pool, time, copyreg
from lxml import etree

def compute(tree):
    print("Start Process")
    print(type(tree)) # Returns <class 'lxml.etree._ElementTree'>
    print(tree.getroot()) # Returns <Element SymCLI_ML at 0x29f5dc8>. Success!

def pool_init(queue):
    # see http://stackoverflow.com/a/3843313/852994
    compute.queue = queue

def elementtree_unpickler(data):
    return etree.parse(BytesIO(data))

def elementtree_pickler(tree):
    return elementtree_unpickler, (etree.tostring(tree),)

copyreg.pickle(etree._ElementTree, elementtree_pickler, elementtree_unpickler)

class Main():
    def __init__(self):
        pass

    def main(self):
        tree = etree.parse('test.xml')
        print(tree.getroot()) #Returns <Element SymCLI_ML at 0x29f5dc8>

        self.queue = multiprocessing.Queue()
        self.pool = multiprocessing.Pool(processes=1, initializer=pool_init, initargs=(self.queue,))
        self.pool.apply_async(func=compute, args=(tree,))
        time.sleep(10)

if __name__ == '__main__':
    Main().main()

更新2

在对内存进行一些基准测试后,我发现传递大型对象会导致无法通过主进程上的垃圾回收来清除对象。这可能不是小规模的问题,但是由etree对象在内存中的数百MB的顺序。只要在语句中使用XML对象调用异步任务,如果从主进程中删除该对象,即使我手动调用垃圾回收,也无法从内存中清除该对象。因此,我已经恢复了在主进程中关闭XML并将文件名传递给子进程。

1 个答案:

答案 0 :(得分:5)

使用以下代码为lxml Element / ElementTree对象注册简单的pickler / unpickler。我过去用过lxml和zmq。

import copy_reg
try:
    from cStringIO import StringIO
except ImportError:
    from StringIO import StringIO
from lxml import etree

def element_unpickler(data):
    return etree.fromstring(data)

def element_pickler(element):
    data = etree.tostring(element)
    return element_unpickler, (data,)

copy_reg.pickle(etree._Element, element_pickler, element_unpickler)

def elementtree_unpickler(data):
    data = StringIO(data)
    return etree.parse(data)

def elementtree_pickler(tree):
    data = StringIO()
    tree.write(data)
    return elementtree_unpickler, (data.getvalue(),)

copy_reg.pickle(etree._ElementTree, elementtree_pickler, elementtree_unpickler)