我有一个docx文件,我需要从中提取所有文本。 docx还有我想忽略/删除的表。
我目前的代码是:
import docx2txt
from docx.api import Document
import docx
#initialize the new columns
ctext = list(textdb['txt'])
ctable = list(textdb['tables'])
#call in the file
x = <docx_filepath>
document = Document(x)
tables = document.tables
#see the actual text of tables
for table in document.tables:
for row in table.rows:
for cell in row.cells:
print (cell.text)
#tells the count of how many tables are in the docx
tablelength = str(len(tables))
ctable.append(tablelength.replace("'",""))
#process the actual text (this includes the table text right now)
text2 = docx2txt.process(x)
ctext.append(text2.replace("'",""))
#write values back to the list
textdb['txt'] = ctext
textdb['tables'] = ctable
我希望文件中包含所有表格文本。现在每个表都将在python中显示为一个单独的元素(EX:docx.table.Table at 0x1d303c4f2b0)
任何帮助都会很棒 - 谢谢,
答案 0 :(得分:0)
#copy the iter_block_items function from https://github.com/python-openxml/python-docx/issues/276
from os import scandir
totaltext= []
for filename in scandir(directory):
sentences = []
x = filename.path
cable = Document(x)
for item in iter_block_items(cable):
sentences.append(item.text if isinstance(item, Paragraph) else '<table>')
totaltext.append(sentences)
然后,您只需运行脚本即可删除“&lt; table&gt;”的所有实例使用re.sub