Question

我有一个docx文件，我需要从中提取所有文本。 docx还有我想忽略/删除的表。

我目前的代码是：

import docx2txt
from docx.api import Document
import docx

#initialize the new columns
ctext = list(textdb['txt'])
ctable = list(textdb['tables'])

#call in the file
x = <docx_filepath>
document = Document(x)
tables = document.tables

#see the actual text of tables
for table in document.tables:
    for row in table.rows:
        for cell in row.cells:
            print (cell.text)

#tells the count of how many tables are in the docx
tablelength = str(len(tables))
ctable.append(tablelength.replace("'",""))

#process the actual text (this includes the table text right now)
text2 = docx2txt.process(x)
ctext.append(text2.replace("'",""))        

#write values back to the list
textdb['txt'] = ctext
textdb['tables'] = ctable

我希望文件中包含所有表格文本。现在每个表都将在python中显示为一个单独的元素（EX：docx.table.Table at 0x1d303c4f2b0）

任何帮助都会很棒 - 谢谢，

Answer 1

#copy the iter_block_items function from https://github.com/python-openxml/python-docx/issues/276
from os import scandir
totaltext= []
for filename in scandir(directory):
    sentences = []
    x = filename.path
    cable = Document(x)
    for item in iter_block_items(cable):
        sentences.append(item.text if isinstance(item, Paragraph) else '<table>') 
totaltext.append(sentences)

然后，您只需运行脚本即可删除“＆lt; table＆gt;”的所有实例使用re.sub

Python - 从docx文件中删除表

1 个答案: