Question

一旦下一个条目开始，我想结束循环。例如，假设我有以下由三个文档组成的txt文件：

Document 1
text1
text1
tex1
Document 2
text2
text2
text2    
Document 3
text3
text3
text3

我正在尝试构建一个JSON文件，该文件将单个文章中的每个文本连接在一起。例如'body' = text1 text1 text1； 'body' = text2 text2 text2;和'body' = text2 text2 text2。为此，我搜索了Document一词，然后将它后面的文本基本串联在一起。问题是我的代码跳过了一个文档，因此仅适用于文档1和3：

for line in f:
    if re.search(r"Document ", line):
        text = ''
        while not re.search(r"Document ", line):
            text += line+' '                     
        article['body'] = text

关于下一个文档开始后如何告诉代码停止（while not）的任何想法？

Answer 1

您可以使用以下Python代码：

article = []
start_matching = False
text = ""
with open(path, "r") as file:
    for line in file:
        if re.match(r"Document\s+\d", line):
            start_matching = True
            if text:
                article.append(text.strip())
                text = ""
            text += line
        else:
            if start_matching:
                text += line
if text:
    article.append(text.strip())

print(article)
# => ['Document 1\ntext1\ntext1\ntex1', 'Document 2\ntext2\ntext2\ntext2', 'Document 3\ntext3\ntext3\ntext3']

请参见online demo。

要点是，仅当行以Document，1+个空格和一个数字（if re.match(r"Document\s+\d", line):）开头时才开始匹配，然后添加属于该文档的行，然后将其附加到列表中（您可以根据需要调整输出）。

Answer 2

如果我们使用的是正则表达式，并且可以在正则表达式中完成所有操作，则让正则表达式完成艰苦的工作：

>>> regex = r"Document\s+\d+((?:(?!\s*Document\s+\d+)\s*.*)+)"
>>> re.findall(regex, str)

输出

['text1\ntext1\ntex1', 'text2\ntext2\ntext2', 'text3\ntext3\ntext3']

请参见live demo here

正则表达式细目：

Document\s+\d+匹配分隔符字符串
(开始捕获＃1组
- (?:非捕获组的开始
  - (?!\s*Document\s+\d+)如果我们没有到达下一个定界符
  - \s*.*匹配当前行
- )+非捕获组的结尾，请尽可能重复
)捕获＃1组结束

下一个文档开始时结束循环（Python 3）

2 个答案: