将docx转换为html会引发python MemoryError

时间:2019-04-16 12:22:23

标签: python html windows openxml docx

我有一个将docx转换为html的函数和一个要转换的大docx文件。 问题在于此函数是较大程序的一部分,转换后的html随后进行解析,因此我无法负担使用其他转换器而不会影响其余代码(不需要的)。在32位安装的python 2.7.13上运行,但也不希望更改为64位。 这是功能:

import logging
from ooxml import serialize
def trasnformDocxtoHtml(inputFile, outputFile):
    logging.basicConfig(filename='ooxml.log', level=logging.INFO)
    dfile = ooxml.read_from_file(inputFile)

    with open(outputFile,'w') as htmlFile:
        htmlFile.write( serialize.serialize(dfile.document))

这是错误:

>>> import library
>>> library.trasnformDocxtoHtml(r'large_file.docx', 'output.html')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "library.py", line 9, in trasnformDocxtoHtml
    dfile = ooxml.read_from_file(inputFile)
  File "C:\Python27\lib\site-packages\ooxml\__init__.py", line 52, in read_from_file
    dfile.parse()
  File "C:\Python27\lib\site-packages\ooxml\docxfile.py", line 46, in parse
    self._doc = parse_from_file(self)
  File "C:\Python27\lib\site-packages\ooxml\parse.py", line 655, in parse_from_file
    document = parse_document(doc_content)
  File "C:\Python27\lib\site-packages\ooxml\parse.py", line 463, in parse_document
    document.elements.append(parse_table(document, elem))
  File "C:\Python27\lib\site-packages\ooxml\parse.py", line 436, in parse_table
    for p in tc.xpath('./w:p', namespaces=NAMESPACES):
  File "src\lxml\etree.pyx", line 1583, in lxml.etree._Element.xpath
MemoryError
no mem for new parser
MemoryError

我能以某种方式增加python中的缓冲内存吗?还是在不影响html输出格式的情况下修复该功能?

0 个答案:

没有答案