Question

我有一份docx文档，其结构分为章节和小节，例如

第A部分

texttexttext



texttexttext

1.1 texttexttext



texttexttext

（a）texttexttext

我想使用python-docx来提取文本。很容易得到段落中的文字，但我不知道如何获得章节标题的文本（例如“1.”和“（a）”等）。有一个简单的方法吗？

Answer 1

它的容易程度取决于文档作者在构建文档时的严谨程度。

最好的情况是，作者已经为所有章节标题使用了样式，然后你可以解析那些带有＆＃34;标题1＆＃34;例如，风格。

for paragraph in document.paragraphs:
    if paragraph.style.name == 'Heading 1':
        print(paragraph.text)

如果作者改为使用粗体和字体大小等字符格式来指定标题，那么您的工作将更加艰难，因为这些标题不太可能唯一标识标题。

Answer 2

我建议您像以下示例一样使用sections：

     document = Document()

     sections = document.sections

     sections

     <docx.parts.document.Sections object at 0x1deadbeef>

     len(sections)

     3

     section = sections[0]

     section

     <docx.section.Section object at 0x1deadbeef>
    for section in sections:

        print(section.start_type)

    NEW_PAGE (2)

    EVEN_PAGE (3)

    ODD_PAGE (4)

如何使用python-docx提取docx文档中的节号？

2 个答案: