Question

我想从“Track Changes”模式中编辑的word文档中提取文本。我想提取插入的文本并忽略已删除的文本。

运行下面的代码我看到在“track changes”模式中插入的段落返回一个空的Paragraph.text

import docx

doc = docx.Document('C:\\test track changes.docx')

for para in doc.paragraphs:
    print(para)
    print(para.text)

有没有办法检索修订后的插入文本（w：ins元素）？

我正在使用python-docx 0.8.6，lxml 3.4.0，python 3.4，Win7

由于

Answer 1

不直接使用body = document._body._body;对于跟踪的更改/修订，还没有API支持。

这是一个相当棘手的工作，如果你搜索元素名称，你会发现它，也许是'open xml w：ins'作为开始，这将把这个文档作为第一个结果： https://msdn.microsoft.com/en-us/library/ee836138(v=office.12).aspx

如果我需要在夹点中做类似的事情，我会使用：

获取body元素

from docx.text.paragraph import Paragraph

inserted_ps = body.xpath('./w:ins//w:p')
for p in inserted_ps:
    paragraph = Paragraph(p, None)
    print(paragraph.text)

然后在上面使用XPath来返回我想要的元素，这些东西模糊地像这个空气代码：

opc-diag

您将自己找出XPath表达式将为您提供所需的段落。

{{1}}可能是此中的朋友，允许您快速扫描.docx包的XML。 http://opc-diag.readthedocs.io/en/latest/index.html

Answer 2

多年来我一直遇到相同的问题（也许只要这个问题存在）。

通过查看@yiftah发布的“ etienned”代码和Paragraph的属性，我找到了一种接受更改后检索文本的解决方案。

诀窍是获取p._p.xml来获取段落的XML，然后在其上使用“ etienned”代码（即从XML代码中检索所有<w:t>元素，其中包含两个常规运行和<w:ins>块）。

希望它可以像我一样帮助失去的灵魂

from docx import Document

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML


WORD_NAMESPACE = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
TEXT = WORD_NAMESPACE + "t"


def get_accepted_text(p):
    """Return text of a paragraph after accepting all changes"""
    xml = p._p.xml
    if "w:del" in xml or "w:ins" in xml:
        tree = XML(xml)
        runs = (node.text for node in tree.getiterator(TEXT) if node.text)
        return "".join(runs)
    else:
        return p.text


doc = Document("Hello.docx")

for p in doc.paragraphs:
    print(p.text)
    print("---")
    print(get_accepted_text(p))
    print("=========")

Answer 3

来自Etienne的以下代码为我工作，它直接使用文档的xml（而不是使用python-docx）

http://etienned.github.io/posts/extract-text-from-word-docx-simply/

如何在python-docx

3 个答案: