Question

我有一个docx，我像这样读入jupyter：

### Import libraries
import docx2txt
import os
import re
import pandas
import docx

### Read document
file_text = docx2txt.process("big_document.docx")

在本文档中，有多个页面具有相同的标题。我想搜索这些标头，然后将所有类似标头分组到它们自己的对象中。在下面的块中，我的文档的前三十页都具有相同的标题EXAMPLE ONE（不是标题格式，只是每页上唯一一个与其他29页匹配的唯一标识字符串）：

### Loop to get appropriate sections, according to the re.findall()
for i in range(0, 30):
    match = re.findall('EXAMPLE\sONE', file_text)
    print(match[i])

re.findall()查找EXAMPLE ONE的每个实例，但只返回这两个单词30次。如果我输入re.split()，并相应地设置范围，它将返回整个文档（几百页）。

### Loop to get appropriate sections, according to the re.split()
for i in range(0, 30):
    match = re.split('EXAMPLE\sONE', file_text)
    print(match[i])

# still returns whole document, instead of just the 30 pages with the chosen header

如何设置代码，使其仅返回具有适当标题的页面，并且仅返回那些页面？我认为re.split()是我的工具，但我无法使其正常工作。

该文档具有多个标题，最高可达EXAMPLE SEVEN，我将为每个对象创建一个for循环，并为一个对象创建return。谢谢

Answer 1

我认为您将无法获得给定标题的匹配页面，因为如果我没错，myfirstfqdn.com IN A 10.10.10.10 mysecondfqdn.com IN A 10.10.10.10不会返回“页面结尾”字符，这可能会允许您指定要结束的内容。

但是，您可以做的是使用这样的正则表达式在之前获取所有内容：

docx

Answer 2

from docx2python import docx2python
from docx2python.iterators import iter_paragraphs
from collections import defaultdict
import re

text = docx2python('path_to_file.docx')
groups = defaultdict(list)
for par in iter_paragraphs(text.document):
    header = re.search('EXAMPLE\s[A-Z]+', par)
    if header:
        open_group = groups[header.group()]
    open_group.append(par)

用正则表达式分割word文档，然后将类似的标题分组到自己的对象中

2 个答案: