Question

我知道那里有类似的问题，但我找不到能回答我祷告的事情。我需要的是一种从MS-Word文件访问某些数据并将其保存在XML文件中的方法。阅读python-docx并没有帮助，因为它似乎只允许一个人写入word文档，而不是阅读。准确地呈现我的任务（或者我如何选择接近我的任务）：我想在文档中搜索关键词或短语（文档包含表格）并从表格中提取关键词/短语的文本数据找到。有人有什么想法吗？

Answer 1

docx是一个包含文档XML的zip文件。您可以打开zip，阅读文档并使用ElementTree解析数据。

这种技术的优点是你不需要安装任何额外的python库。

import zipfile
import xml.etree.ElementTree

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'

with zipfile.ZipFile('<path to docx file>') as docx:
    tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))

for table in tree.iter(TABLE):
    for row in table.iter(ROW):
        for cell in row.iter(CELL):
            print ''.join(node.text for node in cell.iter(TEXT))

有关详细信息和参考，请参阅我对How to read contents of an Table in MS-Word file Using Python?的stackoverflow答案。

Answer 2

使用python-docx搜索文档

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')

# Search returns true if found    
search(document,'your search string')

您还有一个获取文档文本的功能：

https://github.com/mikemaccana/python-docx/blob/master/docx.py#L910

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')
fullText=getdocumenttext(document)

使用https://github.com/mikemaccana/python-docx

Answer 3

似乎pywin32可以解决问题。您可以遍历文档中的所有表以及表中的所有单元格。获取数据有点棘手（每个条目的最后2个字符都必须省略），但除此之外，它是一个十分钟的代码。如果有人需要其他详细信息，请在评论中说明。

Answer 4

具有图像提取功能的更简单的库。

pip install docx2txt

然后使用下面的代码读取docx文件。

import docx2txt
text = docx2txt.process("file.docx")

如何使用Python从doc / docx文件中提取数据

4 个答案: