我正在开发用于在保持文档结构的同时提取Docx文件的部分的算法 我设法得到了标题,但我如何获取标题之间的数据并维护标题层次结构:这是我到目前为止所做的。
示例代码:
from docx import Document
document=Document('headerEX.docx')
paragraphs=document.paragraphs
def iter_headings(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Heading'):
yield paragraph
for heading in iter_headings(document.paragraphs):
print (heading.text)
答案 0 :(得分:0)
这样的事情应该给你一个开始:
sections = []
section_heading = None
section_paragraphs = []
for paragraph in paragraph:
if paragraph.style.name.startswith('Heading'):
section = {
'heading': section_heading,
'paragraphs': section_paragraphs
}
sections.append(section)
section_heading = paragraph.text
section_paragraphs = []
continue
section_paragraphs.append(paragraph)
for section in sections:
print(section['heading'])
for paragraph in section['paragraphs']:
print(paragraph.text)
如上所述,这可能会为您提供第一个空节提取,并且不会捕获最后一节。我将这些细节留给您作为练习来增强您的编码技能:)