Question

我有一个单词docx文件，我想打印出 Bold 中的单词，以xml格式查看文档，我想要打印的单词似乎有以下内容属性。

<w:r w:rsidRPr="00510F21">
  <w:rPr><w:b/>
     <w:noProof/>
     <w:sz w:val="22"/>
     <w:szCs w:val="22"/>
  </w:rPr>
  <w:t>Print this Sentence</w:t>
</w:r>

特别是w:rsidRPr="00510F21"属性，指定文本为粗体。下面是更多的XML文档，可以更好地了解结构。

<w:p w14:paraId="64E19BC3" w14:textId="4D8C930F" w:rsidR="00FF6AD1" w:rsidRDefault="00FF6AD1" w:rsidP="00C11B48">
<w:pPr>
   <w:ind w:left="360" w:hanging="360"/>
   <w:jc w:val="both"/>
   <w:rPr>
       <w:sz w:val="22"/>
       <w:szCs w:val="22"/>
   </w:rPr>
 </w:pPr>
 <w:r>
    <w:rPr><w:b/>
       <w:noProof/><w:sz w:val="22"/>
       <w:szCs w:val="22"/>
    </w:rPr><w:t xml:space="preserve">Some text</w:t>
 </w:r>
 <w:r w:rsidRPr="0009466D">
     <w:rPr><w:i/><w:noProof/>
          <w:sz w:val="22"/><w:szCs w:val="22"/>
     </w:rPr>
     <w:t>For example</w:t>
 </w:r>
 <w:r>
     <w:rPr>
        <w:noProof/>
        <w:sz w:val="22"/>
        <w:szCs w:val="22"/>
     </w:rPr><w:t xml:space="preserve">
     </w:t>
 </w:r>
 <w:r w:rsidRPr="00510F21">
     <w:rPr>
         <w:b/>
         <w:noProof/>
         <w:sz w:val="22"/>
         <w:szCs w:val="22"/>
     </w:rPr>
     <w:t>Print this stuff</w:t>
 </w:r>

在做了一些研究并尝试使用Python-docx库后，我决定尝试使用lxml。我收到有关命名空间的错误，并试图添加该命名空间但它返回一个空集。下面是文档中的一些命名空间。

<w:document
xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" 
xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" 
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
xmlns:mv="urn:schemas-microsoft-com:mac:vml" 
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" 
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"  xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" 
xmlns:w10="urn:schemas-microsoft-com:office:word" 
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" 
xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"            xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk"
xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" 
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
mc:Ignorable="w14 w15 wp14">

以下是我正在使用的代码。如果属性为w:rsidRPr="00510F21"，我还要打印。

from lxml import etree
root = etree.parse("document.xml")

namespaces = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

wr_roots = root.findall('w:r', namespaces)
print wr_roots # prints empty set

for atype in wr_roots:
   if w:rsidRPr == '00510F21':
       print(atype.get('w:t'))

Answer 1

考虑lxml的xpath()方法。召回.get()检索属性，.find()检索节点。并且因为XML在属性中具有名称空间，所以您需要在.get()调用中为URI添加前缀。最后，使用.nsmap对象检索文档根目录下的所有名称空间。

from lxml import etree
doc = etree.parse("document.xml")
root = doc.getroot()

for wr_roots in doc.xpath('//w:r', namespaces=root.nsmap):
    if wr_roots.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr')\
       == '00510F21':
        print(wr_roots.find('w:t', namespaces=root.nsmap).text)

# Print this stuff

Answer 2

如果您想查找所有粗体文字，可以findall()使用xpath表达式使用from lxml import etree namespaces = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'} root = etree.parse('document.xml').getroot() for e in root.findall('.//w:r/w:rPr/w:b/../../w:t', namespaces): print(e.text)：

w:r

不要使用w:rsidRPr="00510F21"作为属性查找w:r个节点（我不相信它表示粗体文本），而是使用w:b查找运行节点（w:rPr）在运行属性标记（w:t）中，然后访问其中的文本标记（w:b）。 './/w:b/../../w:t'标记是粗体属性documented here。

xpath表达式可以简化为{{1}}，虽然这不太严格，可能会导致错误匹配。

尝试使用Python以xml格式解析docx文档以打印粗体字

2 个答案: