Question

使用Ruby，我正在尝试解析一些文档，其中我需要拆分文本块，每个文本都有一个标题，后面跟着一段未知的文本，然后将它们推送到一个数组中;

SECTION 1. A HEADING

Some undetermined length of text,
which can be multiple lines and paragraphs.


SECTION 2. ANOTHER HEADING

Another big block of text.

应该成为

["SECTION 1. A HEADING

Some undetermined length of text,
which can be multiple lines and paragraphs.",
"SECTION 2. ANOTHER HEADING

Another big block of text."]

我可以使用string.split(/\n\n\n/)，但我想要更具体的内容，因为我无法保证每个部分后面都会有两个空白行。多一点试验让我想到了这一点;

string.split(/(?:^|\n)(SECTION.+\n)/).each do |s|
  sections << s
end

但是我必须再次处理输出以获得我需要的东西。

有没有办法在不必多次通过的情况下完成这项工作？

感谢您的帮助。

Answer 1

您可以将String#scan用于多行模式正则表达式并使用正面预测：

text = <<ENDTEXT
SECTION 1. A HEADING

Some undetermined length of text,
which can be multiple lines and paragraphs.


SECTION 2. ANOTHER HEADING

Another big block of text.
ENDTEXT

header = /^SECTION\s+\d+\./
sections = text.scan(/(?m)#{header}.*?(?=#{header}|\Z)/)

puts sections.join("\n---\n")

# =>
SECTION 1. A HEADING

Some undetermined length of text,
which can be multiple lines and paragraphs.



---
SECTION 2. ANOTHER HEADING

Another big block of text.

Answer 2

String＃scan将为您提供所需的数组：

string.scan /^SECTION(?:(?!SECTION).)*/m

通过Ruby标题拆分文本块

2 个答案: