这是我的输入文字
1. INTRODUCTION
This is a test document. This document lines can span multiple lines.
This is another line.
2. PROCESS
This is a test process. This is another line.
3. ANOTHER HEADING
...
我想在主标题1,2,3之间提取文本等等。我正在使用此正则表达式来匹配标题 - ^[ ]{0,2}?[0-9]{0,2}\\.(.*)$
如何在匹配项之间提取文本?
编辑
我尝试使用此代码 -
while(matcher.find()) {
}
如果我在这个while循环中展望下一个匹配的起始索引,它将改变匹配器的状态。如何在使用String.substring
之间获取文字?我需要在当前比赛的结尾和下一场比赛的开始时做一个子串。
答案 0 :(得分:1)
如何在匹配项之间提取文本?
你的意思是1.介绍和2.过程等?如果是这样,如果下一行不是“标题”行,则将文本添加到某个缓冲区。如果是标头,请将缓冲区添加到运行列表中,然后清除缓冲区。
像(伪代码)
之类的东西List<String> content
currentContent = ""
while line = readNextLine()
if not matched header
currentContent += line
else
// found new header, clear the content and add it to the list
if currentContent != ""
content.add(currentContent)
currentContent = ""
编辑:作为一个大字符串
// Split the lines by new lines
String[] bits = yourString.split("\\n");
String currentContent = ""; // Text between headers
List<String> content = new ArrayList<String>(); // Running list of text between headers
// Loop through each line
for (String bit : bits) {
Matcher m = yourPattern.match(bit);
if (m.matches()) {
// Found a header
if (currentContent.length() != 0) {
content.add(currentContent);
currentContent = "";
}
} else {
// Not a header, just append the line
currentContent += bit;
}
}
这样的东西会起作用。我想你可以做一个复杂的多行正则表达式,但这对我来说似乎更容易
答案 1 :(得分:0)
这个怎么样:
String text =
" 1. INTRODUCTION\n"
+ " This is a test document. This document lines can span multiple lines.\n"
+ " This is another line.\n"
+ " 2. PROCESS\n"
+ " This is a test process. This is another line.\n"
+ " 3. ANOTHER HEADING\n";
Pattern pat = Pattern.compile("^[ ]{0,2}?[0-9]{0,2}\\.(.*)$", Pattern.MULTILINE);
Matcher m = pat.matcher(text);
int start = 0;
while (m.find()) {
if (start < m.start()) {
System.out.println("*** paragraphs:");
System.out.println(text.substring(start, m.start()));
}
System.out.println("*** title:");
System.out.println(m.group());
start = m.end();
}
结果是:
*** title:
1. INTRODUCTION
*** paragraphs:
This is a test document. This document lines can span multiple lines.
This is another line.
*** title:
2. PROCESS
*** paragraphs:
This is a test process. This is another line.
*** title:
3. ANOTHER HEADING
您可能希望删除段落前后的换行符。