Question

我想在Python中读取docx文件的页眉和页脚文本。我正在使用python-docx模块。

我找到了这个文档 - http://python-docx.readthedocs.io/en/latest/dev/analysis/features/header.html

但我认为它尚未实施。我还看到有一个＆＃34;功能标题＆＃34;在github中为python-docx分支 - https://github.com/danmilon/python-docx/tree/feature-headers

似乎这个功能从未进入主分支。有谁用过这个功能？你能帮我解决一下如何使用它吗？

非常感谢。

Answer 1

这个问题有一个更好的解决方案：

用于提取的方法

使用MS XML Word文档

只需使用zip模块压缩word文档，即可访问word文档的xml格式，然后可以使用简单的xml节点提取文本。

以下是从docx文件中提取页眉，页脚，文本数据的工作代码。

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    contentToRead = ["header2.xml", "document.xml", "footer2.xml"]
    paragraphs = []

    for xmlfile in contentToRead:
        xml_content = document.read('word/{}'.format(xmlfile))
        tree = XML(xml_content)
        for paragraph in tree.getiterator(PARA):
            texts = [node.text
                     for node in paragraph.getiterator(TEXT)
                     if node.text]
            if texts:
                textData = ''.join(texts)
                if xmlfile == "footer2.xml":
                    extractedTxt = "Footer : " + textData
                elif xmlfile == "header2.xml":
                    extractedTxt = "Header : " + textData
                else:
                    extractedTxt = textData

                paragraphs.append(extractedTxt)
    document.close()
    return '\n\n'.join(paragraphs)


print(get_docx_text("E:\\path_to.docx"))

python-docx中的页眉和页脚

1 个答案: