python docx库可用于Word文档。下面的代码按文档顺序提取所有段落和表格,并将它们附加到列表中。
def iter_block_items(parent):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of either Table or Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
if isinstance(parent, doctwo):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
document_as_list = []
for block in iter_block_items(document):
if 'text' in str(block):
document_as_list.append(block.text)
elif 'table' in str(block):
document_as_list.append(block)
但是上面的代码不会从文档中提取图像,它仅适用于段落和表格。文档中的每个图像都有一个唯一的“ rID”。我已经有了从Word文档中提取图像整体的代码。
但是要求是我要按文档顺序提取图像。如果我将每个图像的“ rID”附加到列表“ document_as_list”中就足够了,因为它们以文档顺序出现在段落和表格中。我知道我们必须操纵word文档的xml。但是我缺乏将其转换为代码的能力。有人可以帮我吗?
我已经经历了以下stackoverflow问题,但我找不到解决此问题的好方法。
答案 0 :(得分:0)
我在以下github链接上分享了此问题的答案:
Reading paragraphs, tables and images in document order from .docx