通过Ruby标题拆分文本块

时间:2012-06-11 23:11:07

标签: ruby arrays text

使用Ruby,我正在尝试解析一些文档,其中我需要拆分文本块,每个文本都有一个标题,后面跟着一段未知的文本,然后将它们推送到一个数组中;

SECTION 1. A HEADING

Some undetermined length of text,
which can be multiple lines and paragraphs.


SECTION 2. ANOTHER HEADING

Another big block of text.

应该成为

["SECTION 1. A HEADING

Some undetermined length of text,
which can be multiple lines and paragraphs.",
"SECTION 2. ANOTHER HEADING

Another big block of text."]

我可以使用string.split(/\n\n\n/),但我想要更具体的内容,因为我无法保证每个部分后面都会有两个空白行。多一点试验让我想到了这一点;

string.split(/(?:^|\n)(SECTION.+\n)/).each do |s|
  sections << s
end

但是我必须再次处理输出以获得我需要的东西。

有没有办法在不必多次通过的情况下完成这项工作?

感谢您的帮助。

2 个答案:

答案 0 :(得分:2)

您可以将String#scan用于多行模式正则表达式并使用正面预测:

text = <<ENDTEXT
SECTION 1. A HEADING

Some undetermined length of text,
which can be multiple lines and paragraphs.


SECTION 2. ANOTHER HEADING

Another big block of text.
ENDTEXT

header = /^SECTION\s+\d+\./
sections = text.scan(/(?m)#{header}.*?(?=#{header}|\Z)/)

puts sections.join("\n---\n")

# =>
SECTION 1. A HEADING

Some undetermined length of text,
which can be multiple lines and paragraphs.



---
SECTION 2. ANOTHER HEADING

Another big block of text.

答案 1 :(得分:1)

String#scan将为您提供所需的数组:

string.scan /^SECTION(?:(?!SECTION).)*/m