从.docx解析xml与lxml给出IOError python

时间:2015-12-01 22:54:50

标签: python xml lxml docx

我从.docx文件中获取一个名为xml_content的xml,xml如下所示:

<?xml version="1.0" encoding="UTF-8"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing">
   <w:body>
      <w:p>
         <w:pPr>
            <w:pStyle w:val="Normal" />
            <w:ind w:left="5070" w:right="0" w:hanging="0" />
            <w:rPr>
               <w:rFonts w:cs="Book Antiqua" w:ascii="Book Antiqua" w:hAnsi="Book Antiqua" />
            </w:rPr>
         </w:pPr>
         <w:r>
            <w:rPr>
               <w:rFonts w:cs="Book Antiqua" w:ascii="Book Antiqua" w:hAnsi="Book Antiqua" />
            </w:rPr>
            <w:t xml:space="preserve">                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     </w:t>
            <w:pict>
               <v:rect id="shape_0" stroked="f" style="position:absolute;margin-left:405pt;margin-top:0pt;width:80.9pt;height:71.9pt">
                  <v:imagedata r:id="rId2" detectmouseclick="t" />
                  <v:wrap v:type="none" />
                  <v:stroke color="#3465a4" joinstyle="round" endcap="flat" />
               </v:rect>
            </w:pict>
            <w:pict>
               <v:rect id="shape_0" stroked="f" style="position:absolute;margin-left:0.05pt;margin-top:0pt;width:71.9pt;height:70.1pt">
                  <v:imagedata r:id="rId3" detectmouseclick="t" />
                  <v:wrap v:type="none" />
                  <v:stroke color="#3465a4" joinstyle="round" endcap="flat" />
               </v:rect>
            </w:pict>
         </w:r>
      </w:p>
...
   </w:body>
</w:document>

使用lxml我想解析这个xml。我的代码如下所示:

import lxml.etree

document = zipfile.ZipFile('test.docx')
xml_content = document.read('word/document.xml')
tree = lxml.etree.parse(xml_content)

当我运行此代码时,我收到此错误:

Traceback (most recent call last):
  File "import.py", line 29, in <module>
    tree = lxml.etree.parse(xml_content)
  File "lxml.etree.pyx", line 3301, in lxml.etree.parse (src/lxml/lxml.etree.c:72453)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:105915)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106214)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105213)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100163)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94286)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95722)
  File "parser.pxi", line 618, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94754)
IOError

1 个答案:

答案 0 :(得分:1)

ZipFile.read()方法返回一个字符串,因此您的变量xml_content是一个字符串,而不是一个类似文件的对象。 lxml.etree.parse()用于解析类文件对象(文件描述符和类似对象)。而是使用lxml.etree.fromstring()

import zipfile
import lxml.etree

document = zipfile.ZipFile('test.docx')
xml_content = document.read('word/document.xml')
tree = lxml.etree.fromstring(xml_content)