Question

我正在使用 python-docx 来提取word文件中的特定表数据。我有一个包含多个表的word文件。 This is the particular table in multiple tables 和the retrieved data need to be arranged like this.

挑战：

我可以使用 python-docx
我可以使用 python-docx

Answer 1

这不是一个完整的答案，但它应该指向正确的方向，并且基于我一直在努力的一些类似的任务。

我在Jupyter笔记本中的Python 3.6中运行以下代码，但它应该只在Python中运行。

首先我们开始但是导入docx Document模块并指向我们想要使用的文档。

from docx.api import Document

document = Document(<your path to doc>)

我们创建一个表列表，并打印其中有多少个表。我们创建一个列表来保存所有表格数据。

tables = document.tables

print (len(tables))

big_data = []

接下来我们遍历表格：

for table in document.tables:

    data = []

    keys = None
    for i, row in enumerate(table.rows):
        text = (cell.text for cell in row.cells)

        if i == 0:
            keys = tuple(text)
            continue
        row_data = dict(zip(keys, text))
        data.append(row_data)
        #print (data)
        big_data.append(data)
print(big_data)

通过循环遍历所有表，我们读取数据，创建列表列表。每个单独的列表代表一个表，在其中我们每行都有字典。每个字典都包含一个键/值对。关键是表中的列标题，value是该列的该行数据的单元格内容。

所以，这是你问题的一半。下一部分将是输出文档中的use python-docx to create a new table - 并使用列表/列表/字典数据中的相应内容填充它。

在我一直在研究的例子中，这是文档中的最终表。

当我运行上面的例程时，这是我的输出：

[{'Version': '1', 'Changes': 'Local Outcome Improvement Plan ', 'Page Number': '1-34 and 42-61', 'Approved By': 'CPA Board\n', 'Date ': '22 August 2016'}, 
{'Version': '2', 'Changes': 'People are resilient, included and supported when in need section added ', 'Page Number': '35-41', 'Approved By': 'CPA Board', 'Date ': '12 December 2016'}, 
{'Version': '2', 'Changes': 'Updated governance and accountability structure following approval of the Final Report for the Review of CPA Infrastructure', 'Page Number': '59', 'Approved By': 'CPA Board', 'Date ': '12 December 2016'}]]

如何使用python-docx检索多个表中的特定表数据？

1 个答案: