我正在开发一个程序,该程序需要从MS Word文档中提取两个图像以在另一个文档中使用它们。我知道图像的位置(文档中的第一个表),但是当我尝试从表中提取任何信息(甚至只是纯文本)时,我会得到空单元格。
我要从中提取图像的Here is the Word document。我想从第一页(第一张表,第0行和第1行,第2列)中提取“ Rentel”图像。
我尝试尝试以下代码:
from docxtpl import DocxTemplate
source_document = DocxTemplate("Source document.docx")
# It doesn't really matter which rows or columns I use for the cells, everything is empty
print(source_document.tables[0].cell(0,0).text)
哪位给我空行...
我在this discussion和this one上读到的问题可能是“包含在Python Docx无法读取的包装元素中”。他们建议更改源文档,但是我希望能够选择以前使用与源文档相同的模板创建的任何文档(因此这些文档也包含相同的问题,因此我无法单独更改每个文档)。因此,仅使用Python的解决方案确实是我考虑解决问题的唯一方法。
由于我也只想要那两个特定的图像,因此通过解压缩Word文件从xml中提取任何随机图像并不真正适合我的解决方案,除非我知道我需要从解压缩的Word文件文件夹中提取哪个图像名称。
我真的希望它能作为我的论文的一部分而工作(而且我只是一名机电工程师,所以我对软件不太了解)。
[EDIT] :这是first image(source_document.tables[0].cell(0,2)._tc.xml
)的xml代码,这里是second image(source_document.tables[0].cell(1,2)._tc.xml
)。但是我注意到,将(0,2)作为行和列的值,可以给我第2列within the first "visible" table中的所有行。单元格(1,2)为我提供了第2列within the second "visible" table中的所有行。
如果无法使用Python Docx直接解决问题,是否有可能在XML代码中搜索图像名称或ID或其他内容,然后使用该ID /名称通过Python Docx添加图像?
答案 0 :(得分:1)
好吧,首先要跳出来的是,您发布的两个单元格(w:tc
元素)每个包含一个嵌套表。这也许是不寻常的,但肯定是有效的构成。也许他们是这样做的,所以他们可以在图像下方或其他下方的单元格中包含标题。
要访问嵌套表,您必须执行以下操作:
outer_cell = source_document.tables[0].cell(0,2)
nested_table = outer_cell.tables[0]
inner_cell_1 = nested_table.cell(0, 0)
print(inner_cell_1.text)
# ---etc....---
我不确定是否可以解决您的整个问题,但令我惊讶的是,这最终是两个或多个问题,第一个是:“为什么我的表格单元格没有显示出来?”第二个可能是“如何从表格单元格中获取图像?” (一旦您实际上找到了该单元格)。
答案 1 :(得分:0)
对于遇到相同问题的人,这是帮助我解决问题的代码:
首先,我使用以下方法从表格中提取嵌套单元格:
@staticmethod
def get_nested_cell(table, outer_row, outer_column, inner_row, inner_column):
"""
Returns the nested cell (table inside a table) of the *document*
:argument
table: [docx.Table] outer table from which to get the nested table
outer_row: [int] row of the outer table in which the nested table is
outer_column: [int] column of the outer table in which the nested table is
inner_row: [int] row in the nested table from which to get the nested cell
inner_column: [int] column in the nested table from which to get the nested cell
:return
inner_cell: [docx.Cell] nested cell
"""
# Get the global first cell
outer_cell = table.cell(outer_row, outer_column)
nested_table = outer_cell.tables[0]
inner_cell = nested_table.cell(inner_row, inner_column)
return inner_cell
使用此单元格,我可以获取xml代码并从该xml代码中提取图像。注意:
replace_logos_from_source
方法中,我知道我要从中获取徽标的表是'tables [0]',并且嵌套表在external_row和external_column'0'中,所以我只填写了它get_nested_cell
方法中,而无需向replace_logos_from_source
def replace_logos_from_source(self, source_document, target_document, inner_row, inner_column):
"""
Replace the employer and client logo from the *source_document* to the *target_document*. Since the table
in which the logos are placed are nested tables, the source and target cells with *inner_row* and
*inner_column* are first extracted from the nested table.
:argument
source_document: [DocxTemplate] document from which to extract the image
target_document: [DocxTemplate] document to which to add the extracted image
inner_row: [int] row in the nested table from which to get the image
inner_column: [int] column in the nested table from which to get the image
:return
Nothing
"""
# Get the target and source cell (I know that the table where I want to get the logos from is 'tables[0]' and that the nested table is in outer_row and outer_column '0', so I just filled it in without adding extra arguments to the method)
target_cell = self.get_nested_cell(target_document.tables[0], 0, 0, inner_row, inner_column)
source_cell = self.get_nested_cell(source_document.tables[0], 0, 0, inner_row, inner_column)
# Get the xml code of the inner cell
inner_cell_xml = source_cell._tc.xml
# Get the image from the xml code
image_stream = self.get_image_from_xml(source_document, inner_cell_xml)
# Add the image to the target cell
paragraph = target_cell.paragraphs[0]
if image_stream: # If not None (image exists)
run = paragraph.add_run()
run.add_picture(image_stream)
else:
# Set the target cell text equal to the source cell text
paragraph.add_run(source_cell.text)
@staticmethod
def get_image_from_xml(source_document, xml_code):
"""
Returns the rId for an image in the *xml_code*
:argument
xml_code: [string] xml code from which to extract the image from
:return
image_stream: [BytesIO stream] the image to find
None if no image exists in the xml_file
"""
# Parse the xml code for the blip
xml_parser = minidom.parseString(xml_code)
items = xml_parser.getElementsByTagName('a:blip')
# Check if an image exists
if items:
# Extract the rId of the image
rId = items[0].attributes['r:embed'].value
# Get the blob of the image
source_document_part = source_document.part
image_part = source_document_part.related_parts[rId]
image_bytes = image_part._blob
# Write the image bytes to a file (or BytesIO stream) and feed it to document.add_picture(), maybe:
image_stream = BytesIO(image_bytes)
return image_stream
# If no image exists
else:
return None
我使用以下方法来调用该方法:
# Replace the employer and client logos
self.replace_logos_from_source(self.source_document, self.template_doc, 0, 2) # Employer logo
self.replace_logos_from_source(self.source_document, self.template_doc, 1, 2) # Client logo