我无法弄清楚如何从Characters
文件的XML中获取标记DOCX
。 DOCX
文件包含多个文件,包括app.xml
。我想从此<Characters>
获取标记或属性XML
。
from lxml import etree
def docx_get_characters_number(path):
document = zipfile.ZipFile(path)
xml_content = document.read('docProps/app.xml')
document.close()
root = etree.fromstring(xml_content,etree.XMLParser())
return root.xpath('.//Characters')
此函数返回[]
,但我无法弄清楚原因。
为了测试解析器是否有效,我打印了root.xpath('.//*')
,它返回了这个:
[<Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}Template at 0x3a8d260>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}TotalTime at 0x3a8d288>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}Pages at 0x3a8d2b0>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}Words at 0x3a8d2d8>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}Characters at 0x3a8d300>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}Application at 0x3a8d328>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}DocSecurity at 0x3a8d350>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}Lines at 0x3a8d378>, ... etc.
你知道问题出在哪里吗?
我找到了一种方法,但它不优雅,我想我应该继续寻找另一种方式,但是:
def docx_get_characters_number(path):
document = zipfile.ZipFile(path)
xml_content = document.read('docProps/app.xml')
document.close()
these_regex="<Characters>(.+?)</Characters>"
pattern=re.compile(these_regex)
return re.findall(pattern,xml_content)[0]
答案 0 :(得分:1)
这是与默认命名空间相关的常见问题。您的XML在根元素处声明了默认名称空间:
xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}"
这意味着,所有没有前缀的元素,包括目标元素<Characters>
,都被认为是在该命名空间中。在命名空间中引用元素的正确方法是将前缀映射到命名空间URI,并在XPath中相应地使用该前缀:
.....
ns = {'d': 'http://schemas.openxmlformats.org/officeDocument/2006/extended-properties'}
return root.xpath('d:Characters', namespaces=ns)
答案 1 :(得分:0)
在这种情况下的问题是在文件的xml声明中。独立属性可能只包含值yes或no。因此,在解析阶段,我的机器上的lxml会失效...
当我使用以下文件时:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties>
<Characters>13088</Characters>
</Properties>
以下脚本确实找到包含元素的字符数:
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
"""Explain here what this module has to offer."""
from __future__ import print_function
from lxml import etree as et
def docx_get_characters_number(path_unzipped):
"""Changed to focus on parsing the XML file."""
with open(path_unzipped, 'rt') as f:
xml_string = f.read()
return et.fromstring(xml_string, et.XMLParser()).xpath('.//Characters')
if __name__ == '__main__':
path_unzipped = 'docProps/app.xml'
print(docx_get_characters_number(path_unzipped))
所以它给出了:
[<Element Characters at 0x108bcb518>]
相比之下,在xml声明中替换从“是”到“真”的独立属性的值给出:
Traceback (most recent call last):
File "/some_place/so_parse_characters_xpath.py", line 17, in <module>
print(docx_get_characters_number(path_unzipped))
File "/some_place/so_parse_characters_xpath.py", line 13, in docx_get_characters_number
return et.fromstring(xml_string, et.XMLParser()).xpath('.//Characters')
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77697)
File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116494)
File "src/lxml/parser.pxi", line 1707, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115144)
File "src/lxml/parser.pxi", line 1079, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:109543)
File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103404)
File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105058)
File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:103967)
lxml.etree.XMLSyntaxError:standalone仅接受'yes'或'no',第1行,第50列
答案 2 :(得分:-1)
你可以像这样使用etree模块:
#to load the xml
from lxml import etree
doc=etree.parse("the path to your xml file")
#to find the Characters tab
doc.find("Characters").text
当然,如果Characters
不是根节点,那么您需要填写完整路径。像Property/Character